direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Page Content

General Notes, Dr. Ralf-Detlef Kutsche, Academic Director

Firstly, I would like to mention that in my position of „Academic Director“ at DIMA, I take the responsibility for Prof. Markl to coordinate all Master’s and Bachelor’s theses at DIMA research group, which means: 

  • The formal procedure of applying and registering for a thesis (with a proposal coming from you, the candidate and your potential advisor) must appear on my desk. Comments, additional advice, etc. will come to you in due time.
  • In case of industrial collaboration, there must be a meeting with me, in order to clarify the rules and conditions we impose at DIMA and the DFKI/IAM department of Prof. Markl.
  • The final defenses of each thesis will be done in our DIMA MSc/BSc colloquium (typically we have one Friday afternoon per month) under my direction.
  • In all these cases, please contact me directly, In general, this will be the case after you have reached an agreement with your chosen advisor from the DIMA / DFKI groups. If you are completely lost in „idea space“ then you can ask for an appointment with general guidance!

Contact:

  • by email (ralf-detlef.kutsche@tu-berlin.de), or
  • in my office hours (EN-726, Tuesday, 12-13, by appointment).

Secondly, from time to time, I also take the responsibility to be the advisor of theses, deeply related to my research areas, i.e. preferably being one of my students in the INFMOD (Advanced Information Modeling, late Bachelor’s or early Master’s class, depending on your course of study!), or in the very advanced Master’s class AIM-1 / HDIS (Advanced Information Management 1 – Heterogeneous and Distributed Information Systems).

Researchers and Theses Opportunities

Dr. Kaustubh Beedkar

kaustubh.beedkar@tu-berlin.de

Research Area: "Geo-Distributed Data Analysis"

Topic Area: Constraint-aware Query Processing for Geo-Distributed Data Analysis

Many large organizations today have a global footprint and operate data centers that produce large amounts of data at different locations around the globe. Analyzing such geographically distributed data as a whole is essential to derive valuable insights. Typically, geo-distributed data analysis is carried out either by first communicating all data to a central location where analytics is performed or by a distributed execution strategy that minimizesdata communication. However, legal constraints arising from regulationspertaining to data sovereignty and data movement (e.g., prohibition of the

transfer of certain data across national borders) pose serious limitations to existing approaches. In this context, our research explores

  1. various possibilities for declaratively specifying legal constraints and
  2. methods and algorithms to automatically derive distributed execution strategies under such constraints.

Please arrange a meeting with me to discuss concrete thesis opportunities. Students are also encouraged to propose their own topic in the ambit of above research problem.


Prerequisites: Strong programming skills (preferably in Java), knowledge in query planning and execution in DBMS,

(nice to have) taken IDB-PRA, DBT, or other database lab courses and seminars.

Dr. Alexander Borusan

alexander.borusan@tu-berlin.de

Research Area: „Data Streams Management in Embedded Information Systems“

Topic Area: Data Stream Modeling and Processing

Typical applications of embedded information systems (automotive, avionics, manufacturing control) include two main tasks: monitoring and controlling. Many sources of such applications produce data continuously as a stream. A data stream is an ordered sequence of data that can be read only once and should be processed in real time. From the point of view of a data stream management, several tasks need to be solved: modelling (data stream models), processing (data structuring and data reduction), quering (types of queries), scheduling, and storaging. Additionaly data streams analysis becomes one of the important tasks in the last decade.


Data Streams Modeling:

Architectures and models of real time streaming in embedded information systems (automotive, avionics)


Data Streams Processing:

Taxonomy and comparison of the data reduction techniques for the data streaming in the automotive applications


Dr. Sebastian Breß

sebastian.bress@dfki.de

Research Area: „Data Management on Modern Hardware“

Topic Area: Data Management on Modern Hardware

Modern Hardware influences data management in virtually every aspect. With growing main memory sizes to the Terabyte scale, we can keep databases in-memory. This shifts the performance bottleneck from disk IO to main memory access and computation. This has a number of consequences: Database operators need to be tuned to be cached efficiently, and must be multi threaded to make efficient use of modern processors, including new processor architectures such as GPUs or MICs. Query execution needs to be made CPU and cache efficient. Transaction processing needs can be optimized using hardware transactions and non-volatile memory.

We also encourage students to propose own topics in the field of data management on modern hardware.

Prequisites: Strong programming skills in C/C++, deep knowledge in database implementation techniques
Nice to have:
Knowledge in LLVM, CUDA or OpenCL.

 

Detailed Topics proposals:

Collaborative Query Processing on Heterogenous Processors

Here, you would extend CoGaDB and evaluate the performance impact of collaborative query processing. The most important related work is the Morsel Paper (the paper proposes a strategy to make a database NUMA aware, but the same idea can be applied to heterogenenous processors with dedicated memory).

 

Prototype a High Performance Stream Processing System

Hardcode 4-5 streaming queries in C and compare their performance to Apache Flink and Apache Storm. The goal is to find out whether it is beneficial to write a specialized code generator for streaming systems. If yes, prototype a simple code generator to support some easy streaming queries.

Bonaventura Del Monte

delmonte@tu-berlin.de

Research Area: „State Management for Distributed Data Stream Processing“

Topic Area: Improvement of the End-to-End Management of Large State in Distributed Stream Processing Engines

I offer Master's theses related to my research area, which seeks to enhance the end-to-end management of large state in distributed stream processing engines. Resource elasticity, faultolerance, load balancing, robust execution of stateful query, and query plans optimization are first-class citizens of my research agenda.

If you are interested in a topic in this area, please, provide me with some information about you, your interests, your programming skills, and your CV.

 

Detailed available topics:

End-to-End Management of Large Scale Streaming Experiments.

Executing distributed experiments is a tedious, error-prone process, especially when stream processing engines (SPEs) are involved. When benchmarking the behaviors of a SPE, the researchers are always after a number of metrics, which are in the form of time series. In a distributed setup, the number of metrics grows linearly with the number of nodes. Furthermore, faulty behaviors might show up at some point of the experiment. The goal of this thesis is to design and develop a framework that allows researchers to define and run large scale experiments involving few systems (e.g., SPEs, data generators, and a number of third-party systems), gather all the metrics the users are interested in from those systems, and provide a GUI that help them to analyze the results.

Requisites:   strong programming skills in Java (and C++), good understanding of the JVM and its memory model, and good knowledge of Apache Flink APIs

Nice to have: good knowledge in one (or more) of the following topics: Apache Flink internals, network programming, and distributed systems

Gabor Gévay

gevay@tu-berlin.de

Research Area: „Programming Models in Data Analytics“

Topic Area: Compiling Datalog Programs to Apache Flink

Datalog is a highly declarative database query language, allowing users to express queries in a concise way and leaving the details of execution to the system. There are several works [1,2,3,4,5,6,7] aiming to execute Datalog queries on large amounts of data by using distributed dataflow systems such as Apache Spark. Building on this line of work, the aim of this thesis is to execute Datalog queries on Apache Flink. Flink is a state-of-the-art dataflow system supporting cyclic dataflows [8], which makes iterative computations more efficient. Therefore, Flink would be well-suited to implement the so-called semi-naive bottom-up strategy [9] of executing Datalog queries.

The closest work and state of the art is probably [5]. The Flink implementation could have a better performance, and it could even be simpler by relying on Flink’s incremental iterations. Several complications that arise in [5], such as dealing with the overhead of launching many dataflow jobs, are automatically solved by Flink’s cyclic dataflows, whereby the entire program is executed as a single (cyclic) dataflow job.

 

[1] Afrati, F. N., Borkar, V., Carey, M., Polyzotis, N., & Ullman, J. D. (2011, March). Map-reduce extensions and recursive queries. In Proceedings of the 14th international conference on extending database technology (pp. 1-8). ACM.

[2] Shaw, M., Koutris, P., Howe, B., & Suciu, D. (2012, September). Optimizing large-scale Semi-Naïve datalog evaluation in hadoop. In International Datalog 2.0 Workshop (pp. 165-176). Springer, Berlin, Heidelberg.

[3] Bu, Y., Borkar, V., Carey, M. J., Rosen, J., Polyzotis, N., Condie, T., ... & Ramakrishnan, R. (2012). Scaling datalog for machine learning on big data. arXiv preprint arXiv:1203.0160.

[4] Wang, J., Balazinska, M., & Halperin, D. (2015). Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. Proceedings of the VLDB Endowment, 8(12), 1542-1553.

!! [5] Shkapsky, A., Yang, M., Interlandi, M., Chiu, H., Condie, T., & Zaniolo, C. (2016, June). Big data analytics with datalog queries on Spark. In Proceedings of the 2016 International Conference on Management of Data (pp. 1135-1149). ACM.

[6] Wu, H., Liu, J., Wang, T., Ye, D., Wei, J., & Zhong, H. (2016, November). Parallel materialization of datalog programs with Spark for scalable reasoning. In International Conference on Web Information Systems Engineering (pp. 363-379). Springer, Cham.

[7] Rogala, M., Hidders, J., & Sroka, J. (2016, June). DatalogRA: datalog with recursive aggregation in the Spark RDD model. In Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems (p. 3). ACM.

[8] Ewen, S., Tzoumas, K., Kaufmann, M., & Markl, V. (2012). Spinning fast iterative data flows. Proceedings of the VLDB Endowment, 5(11), 1268-1279.

[9] Ceri, S., Gottlob, G., & Tanca, L. (1989). What you always wanted to know about Datalog (and never dared to ask). IEEE transactions on knowledge and data engineering, 1(1), 146-166.

Dr. Holmer Hemsen

holmer.hemsen@dfki.de

Research Area: „Scalable Signalprocessing / Industrie 4.0“

Topic Areas: Data Analytics of Massive Time Series, Intelligent and Scalable Resource Management for Industrie 4.0

Data Analytics of Massive Time Series
A time series is a set of observations each recorded at a specific time. Examples of time series are manifold, e.g. electrocardiography curves, stock market data, seismic measurements, network load. Time series analysis comprises a wide range of methods, such as, anomaly and outlier detection, forecasting, and pattern recognition. The focus of this topic area is on research of methods for analysis of massive and/or multi-dimensional time series data.

Intelligent and Scalable Resource Management for Industrie 4.0
The goal of Industrie 4.0 is to digitalize, automate and optimize industrial production systems. In many cases this involves upgrading of conventional production systems into cyber-physical systems, often by utilizing Internet of Things (IoT) technology. The focus of this research topic is on methods for scalable optimization of production lines and intelligent forecasting of consumable resources to calculate optimal dynamic maintenance strategies.

Prerequisites: Strong programming skills in Java, Scala or Python; Good writing skills; Preferable knowledge of Apache Flink

 

 

Jeyhun Karimov

jeyhun.karimov@dfki.de

Research Area: „Benchmarking & Concurrent Query Processing“

Topic Areas: Benchmarking Data Processing Systems, Concurrent Query Processing

Benchmarking Data Processing Systems

With the development of big data systems in recent years, a variety of benchmarks have been proposed both from industry and academia. 

The goal of benchmarks is to develop a novel benchmarking methodology to evaluate and compare the workloads on a set of systems, which eventually lead to a technology improvement.

I offer topics to benchmark data processing systems, such as graph processing and stream processing systems.

 

Concurrent Query Processing

In the last decade, many distributed data processing engines were developed to perform continuous queries on massive online data. The central design principle behind these engines is to handle queries with a query-at-a-time model – optimizing each query separately. With the adoption multi-tenant clouds,  it is essential to enable new optimization frameworks that shares the data and computation among available queries. 

I offer topics related concurrent query optimization (single- or multi-objective). 

Martin Kiefer

martin.kiefer@tu-berlin.de

Research Area: „Approximate Data Analysis Using Modern Hardware“

Topic Areas: Data Stream Summarization Using Custom Hardware (FPGAs), Improving Query Optimization Using Modern Hardware

My research investigates the combination of data approximation techniques and modern hardware architectures to increase the efficiency data analysis.

 

Data Stream Summarization Using Custom Hardware (FPGAs)


Power-efficienct data analysis is an increasingly important problem in the era of big data: The amount of available data continues to increase exponentially and for economic and environmental reasons, we need to ensure that the energy demands required to analyze the data do not grow exponentially as well. I am approaching this problem for data stream analysis by combining the potential of stream summarization techniques and custom hardware on FPGAs.


Improving Query Optimization Using Modern Hardware


The query optimizer is at the heart of state-of-the-art relational database systems. It derives different execution plans for a given query and selects the cheapest one based on statistics and a cost model. However, this has to be done in a very tight time budget, since query optimization delays query execution. I am investigating how modern hardware can help with this task. In particular, I’m improving the statistics available to the query optimizer by using bandwidth-optimized kernel density models as learning selectivity estimators on GPUs.


Requirements

I offer thesis topics based on current research questions, student interests, and student skills. Students with an interest in modern hardware are preferred, but I may also provide thesis topics without a hardware focus. Skills in C/C++, OpenCL, VHDL, or Python programming might be useful.


If you are interested in my research topics, we can arrange a meeting for a discussion. Please include a CV in your request.

Andreas M. Kunft

andreas.kunft@tu-berlin.de

Research Area: „Mixed Linear and Relational Algebra Pipelines“

Topic Area: Deeply-Embedded DSL with Abstract Data Types (ADT) for (Distributed) Collections and Matrices

Today's data analysis pipelines go beyond pure linear algebra and often include data generation and transformation steps (ETL) that are best defined using relational algebra operators.
In contrast, current systems either provide each domain as separate library, limiting optimizations to each library's domain in isolation (e.g., Python scikit-learn), or they map operations of the foreign domain on top of their domain (e.g., Spark, Tensorflow).

I conduct research based on a deeply-embedded DSL with abstract data types (ADT) for (distributed) collections and matrices. Therefore, Data analysts can express complete pipelines in a single language. Both ADTs are explicitly reflected in a common intermediate representation including control flow. Based on this IR, I experiment with new ways of holistic optimizations over both ADTs.


Offered Thesis:

I offer thesis in the described area based on my current topics and the student's interests.

For further information, please contact me via email including your CV, programming skills, and interests.

Dr. Ralf-Detlef Kutsche

ralf-detlef.kutsche@tu-berlin.de

Research Area: „Model Based Software and Data Integration“, focussing on „Semantic Concepts / Semantic Interoperability“

Topic Area: Model Based Methods for the Development of Heterogeneous and Distributed Information Systems in Large Scale

Since the 80’s of the last century, there is a huge discussion on the quality of software and information systems (which today is absolutely a top issue for our modern data world, as we can see from statements like „Data is the gold of modern times“ in the „big data“, „data analytics“, and „data science“ fields, where Prof. Markl and his groups play a fundamental role in the world with the BBDC – Berlin Big Data Center, with the university spin-off Data Artisans (Apache FLINK), and with several International Master’s programmes and tracks like Erasmus Mundus IT4BI and BDMA, EIT Digital ICT-Innovation Master track „Data Science“, TUB local Master track „Data Analytics“) and many other activities.

Unfortunately, two Gardner studies of 1985 and 2005 show the same desastrous result: Only approx. 25% of the software a projects started come to a successful end in time and in budget, another 25% after some time delay, budget overriding, and maybe even reduced functionality and performance. Remaining 50% die on the way, or never are even started properly after some initial planning! 

Model based methods (in earlier times following the MDA (Model Driven Architecture) ideas of the OMG (Object Management Group, an international stardardisation and management consortium of almost all active large companies in the world) since many years promise to improve as well the quality as also to reduce the cost of software dramatically (up to 70%) by models applied to all (!) phases of the whole software process – in our case for the development of (potentially, but in  most cases) Heterogeneous and Distributed Information Systems (HDIS) in large scale!

Applying models (e.g. UML models in simple cases, but better: ‚domain specific modeling languages‘) and semantic concepts (e.g. ontologies, formal logic and semantics, metadata stardards and (meta-) thesauri can support these methods significantly, as the results of two very large industrial R&D projects (among many others) under my scientific guidance show (e.g. BiZYCLE, 2007-2010, and BIZWARE, 2010-2013, funded by the German ministry of research BMBF).

Candidates being interested in these topics should have an excellent background in databases and information systems, in software engineering and software architecture, in formal methods and mathematics or theoretical computer science, particularly logic formalisms and languages, and, of course, in modeling with classical modeling languages in the UML family, with E/R (anyway known from each DBS couse), with BPMN or any other process/workflow modeling language, and should be interested in application domains (like health care, my main application area since 30 years, automotive industry, business intelligence, and, very relevant for the future, the energy sector!)

In case you fulfill these requirements, and you participated in my classes, or you can prove your knowledge gained from other universities in these fields, please apply for a thesis under or in my office hours (Tue, 12-13, during semester time, by appointment).

Clemens Lutz

clemens.lutz@dfki.de

Research Area: „Fast & Scalable Data Processing on Modern Hardware“

Topic Area: Fast & Scalable Data Processing on Modern Hardware

GPUs and other modern processors are capable of very fast data processing. For example, a high-end Nvidia GPU is capable of reading at 900 GB/s from its built-in memory, and can compute 14 billion floating-point operations per second (i.e., 14 TFLOPS). This is more than 20 times faster than regular CPUs.

Our aim is to use this processing power to analyze data. This goal opens the door to many research challenges. For example, large data sets do not fit into the GPU’s on-board memory. How can we efficiently access data from a GPU? How can we make a SQL JOIN operator run fast?

Contact me via E-mail if you like writing fast code and are interested in programming GPUs.

Thesis are available for Bachelor’s and Master’s students. Include a short text about your skills and research interests, and attach your CV. I encourage you to propose your own thesis topic.

 

 

Dr. Alireza Mahdiraji

alireza.rm@dfki.de

Research Area: „Approximate Query Processing on Data Streams“

Topic Area: Distributed Summarization Data Structures

Many real-world applications (e.g., traffic monitoring, cluster health monitoring, web log analysis, online services) generate data streams at unprecedented rate and volume. Traditional query processing over such massive amounts of streaming data often results in high latency and increased computational cost. This overhead is even more pronounced for query processing over distributed data streams. On the other hand, Approximate Query Processing (AQP) provides approximate answers to queries at a fraction of cost of the original queries and is a mean to achieve interactive response times (sub-second latencies) when faced with voluminous data. Interactive query response times (at the cost of accuracy) is useful for many tasks like exploratory analytics, big-data visualization, or trend analysis. In particular, AQP techniques utilize data synopses (or summaries), much smaller representations of the data, used to quickly answer queries at the cost of accuracy. Examples of such synopses are using samples, histograms, wavelets, and sketches.

Our research focuses on developing methods for efficient construction and maintenance of data synopses for large amounts of streaming data that is generated in a distributed fashion.

Diogo Telmo Neves

 diogo_telmo.neves@dfki.de

Research Area: „Data Management“

Topic Area: Synthetic Data

Title: Synthetic Data Generator from Datasets, Structural and Statistical Metadata, and
Domain Knowledge.


Description: Due to legal, ethical, privacy or others issues, getting access to datasets is extremely difficult, if not impossible. This is a very common bottleneck that impairs data analytics in many domains (e.g. the healthcare domain). The aim of the research of this project is to develop a synthetic data generator. The generation of synthetic data should be guided by metadata as well as by domain knowledge and not just by statistical methods and metrics that can be derived from raw data. To be flexible enough,
the generator should have a set of parameters and hyperparameters that would allow, for instance, the generation of data points that would be classified as abnormals or outliers.


Expected Outcome: A (Python) library that allows to generate synthetic data from datasets, structural and statistical metadata, and domain knowledge.


Requisites:
• Strong skills on using libraries such as scikit-learn and Pandas.
• Strong knowledge of procedural, functional, and object-oriented programming paradigms.
• Strong programming skills in at least one of the following programming languages: Python, Java, and C++.
• Strong knowledge about algorithmic complexity.


Nice to Have:
• Good knowledge on statistics.
• Good knowledge of machine learning algorithms and techniques (e.g. GAN).
• Good knowledge of RDBMS as well as graph databases (e.g. Neo4j).
• Good knowledge on concurrent, parallel, and distributed programming.


If this topic sounds to you, please, send me an email with a few lines about you, your
background, your research interests, your programming skills, and attach to it your CV.

Topic Area: Data Quality

Titel: Automated Improvement of Data Quality by Incorporating Structural and
Statistical Metadata, and Domain Knowledge.


Description: Often, datasets lack (data) quality which impairs the accuracy of machine learning algorithms and turn harder, if not impossible, the implementation of a prediction model that is able to generalize well on unseen data. Thus, poor data quality has far reach consequences than the ones that a data scientist can initially foresee. The aim of the research of this project is to investigate if it is possible to automate and improve the quality of datasets by incorporating structural and statistical metadata and domain knowledge as means to detect data issues and report on them, and, then, apply data transformations to automatically fix the previously detected data issues and maintain a data lineage that allows to understand and explain every data transformation that has been applied to the original dataset.


Expected Outcome: A (Python) library that automates the improvement of data quality.


Requisites:
• Strong skills on using libraries such as scikit-learn and Pandas.
• Strong knowledge of procedural, functional, and object-oriented programming paradigms.
• Strong programming skills in at least one of the following programming languages: Python, Java, and C++.
• Strong knowledge about algorithmic complexity.


Nice to Have:
• Good knowledge on statistics.
• Good knowledge of RDBMS as well as graph databases (e.g. Neo4j).
• Good knowledge on concurrent, parallel, and distributed programming.
• Good knowledge in at least one of the following processing engines: Apache Flink, Apache Spark, and Apache Kafka.


If this topic sounds to you, please, send me an email with a few lines about you, your
background, your research interests, your programming skills, and attach to it your CV.

Alexander Renz-Wieland

alexander.renz-wieland@tu-berlin.de

Research Area: „Large-Scale Machine Learning“

Topic Area: Large-Scale Machine Learning

Training machine learning (ML) models on a cluster instead of a single machine increases the amount of available compute and memory, but requires communication among cluster nodes for synchronizing model parameters. For some ML models, this synchronization can become the dominating part of the training process, such that using more computers does not result in the intended speed-up.

To avoid much of this communication, researchers developed algorithms that create and exploit parameter locality. That is, at a given point in time each of the workers updates only a subset of the model parameters. These subsets typically change throughout the training process, i.e., workers update different subsets throughout training. Such algorithms exist for multiple types of ML models. The locality can stem from the training algorithm, the ML model, or the training data.

ML developers typically need to implement such locality-exploiting algorithms from scratch, i.e., they have to know about low-level details of distributed computing. We are developing a system that allows researchers and practitioners to implement such algorithms without detailed knowledge of distributed computing. Our approach is to make the state-of-the-art architecture for distributed ML, so-called parameter servers, usable and efficient for locality-exploiting algorithms.  

I offer multiple thesis topics related to this line of work. For example, theses can work on aspects of the system or apply the system to specific ML models.

If you are interested, don't hesitate to contact me to arrange a meeting. Please provide me with some information about you, your interests, your prior experience, your programming skills, and your CV.

Ideally, you have coursework or other prior experience in machine learning, C++ programming, and/or distributed systems.

Viktor Rosenfeld

viktor.rosenfeld@dfki.de

Research Area: „Adapting Data Processing Code to Different Processor Types Without Manual Tuning“

Topic Area: Data Processing on Heterogeneous Processors

In the last decade, processors have become increasingly diverse, parallelized, and specialized for specific tasks. For example, in addition to multi-core CPUs, there are GPUs, Intel Xeon Phis, and FPGAs. Often, developers have to write program code that is specific to a particular processor to fully exploit its resources.

In my research, I study how the database can adapt its operator code automatically to the processor that it’s currently running on. In essence, my goal is to write a database system that learns how to rewrite itself until it runs as fast as possible on any given processor. To this end, I work a lot with OpenCL which is a programming standard that enables users to run the same program on different types of processors such as CPUs, GPUs, etc.

Prerequisites: Strong programming skills in C/C++, interest in low-level programming, interest in processor architecture.
Nice to have: Knowledge in GPU programming (e.g., OpenCL and/or CUDA); interest in automatic tuning.

I offer to mentor both bachelor and master thesis in the context of data processing on heterogeneous processors. I encourage students to develop their own ideas. The proposals below can be used as a starting point.

Please contact me via email to discuss your ideas for a thesis. Be sure to include a short text about your skills and interests, and attach your CV.

 

Detailed topic proposals:

Evaluation of Hash-Based Grouped Aggregation Algorithms on GPUs

Hash-based grouped aggregation has been studied extensively on multi-core CPUs. In general, one of three algorithms works best, depending on the group cardinality. However, as the cardinality is not always known in advance, there are also algorithms that do not assume prior knowledge and degrade gracefully to large cardinalities.

The goal of this work is to port and adapt these algorithms, which are written for multi-core CPUs, to GPUs using OpenCL as a target language. The algorithms should be integrated into an existing test suite. A thorough evaluation of these algorithms and a comparison with existing algorithms is also part of this thesis.

 

Implicit Vectorization on Intel CPUs

Modern CPUs use so-called SIMD instructions to apply the same instruction (such as addition) to multiple data items in a single cycle. Without exploiting SIMD capabilities, the resources of modern CPUs go to waste. However, SIMD instructions are generally not portable from one processor generation to the next.

The Intel OpenCL compiler tries to use SIMD instructions automatically through a process called auto-vectorization. The goal of this work is to take an existing vectorized formulation of data processing operators, and reformulate them in a way that they can be efficiently vectorized by the Intel OpenCL compiler. A thorough evaluation of these operators and a comparison with existing native SIMD versions is also part of this thesis.

Dr. Anne Schwerk

anne.schwerk@dfki.de

Research Area: "Data Management"

Topic Area: Analysis of Electronic Health Record Data

Title: Data processing, mining, and predictions of electronic health record data

 

Description: With advancing digitalization, more and more electronic health records (EHRs) are available, also in Germany. While access to those records is often restricted in Germany, there exist some open EHRs on international levels.

The aim of this research topic is to apply data mining techniques on openly available EHRs, such as MIMIC III, and explore different prediction models, such as k means, support vector machines, and random forests etc.

This is an opportunity to learn more about the specific challenges of healthcare data sources, including their management, pre-processing, and analysis.

Depending on the time frame of your thesis, there might be an option to cross-validate findings on real healthcare data.

 

Requisites:

·        Strong skills on managing data, e.g. cleaning, integration, modeling.

·        Strong programming skills in at least one of the following programming languages: Python, Java, and C++.

·        Strong knowledge on statistics.

 

Nice to Have:

·        Good knowledge of machine learning algorithms and techniques

·        Good knowledge of RDBMS

·        Interest in healthcare/ medical topics

 

If this topic sounds interesting to you, please, send me an email with your CV, a few lines about your background, research interests, and about your programming skills.

Juan Soto

juan.soto@tu-berlin.de

Research Area: „Data Analysis / Data Analytics“

Topic Area: Exploratory Data Analysis, Numerics in Data Analytics

Exploratory Data Analysis

An Analysis of Current Approaches/Solutions for Big Data Problems and Devising Novel Technique

Numerics in Data Analytics

A Closer Look at Software Quality in Existing Big Data Analytics Libraries: Challenges and Pitfalls.

Jonas Traub

jonas.traub@tu-berlin.de

Research Area: „On-Demand Data Stream Processing“

Topic Area: On-Demand Data Gathering in the Internet of Things

Real-time sensor data enables diverse applications such as smart metering, traffic monitoring, and sport analysis. In the Internet of Things, billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. Therefore, it is required to produce and process data streams which are tailored to the data demand of applications. My research goal is to optimize communication costs in the IoT while maintaining the desired accuracy and latency of stream processing jobs.

 

I offer thesis topics based on current research questions, student interests, and student skills.

Please contact me to arrange a meeting for a discussion about concrete thesis opportunities. Please provide me with some information about you, your interests, your programming skills, and your CV.

Dr. Steffen Zeuch

steffen.zeuch@dfki.de

Research Area: „Query Optimization and Execution on Modern CPUs“

Topic Area: Query Optimization and Execution on Modern CPUs

Over the last decades, database system have been migrated from disk to memory architectures such as RAM, Flash, or NVRAM. Research has shown that this migration fundamentally shifts the performance bottleneck upwards in the memory hierarchy. Whereas disk-based database systems were largely dominated by disk bandwidth and latency, in-memory database systems mainly depend on the efficiency of faster memory components, e. g., RAM, caches, and registers.

With respect to hardware, the clock speed per core reached a plateau due to physical limitations. This limit caused hardware architects to devote an increasing number of available on-chip transistors to more processors and larger caches. However, memory access latency improved much slower than memory bandwidth. Nowadays, CPUs process data much faster than transferring data from main memory into caches. This trend creates the so-called Memory Wall which is the main challenge for modern main memory database systems.

To encounter these challenges and enable the full potential of the available processing power of modern CPUs for database systems, we propose theses to reduce the impact of the Memory Wall.

We also encourage students to propose own topics in the field of query optimization and processing on modern CPUs.

Requirements: Strong programming skills in C/C++, deep knowledge in database implementation techniques, good understanding of computer architecture
Optional Knowledge in LLVM, Vtune, MPI, OpenMP


Zusatzinformationen / Extras