direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Es gibt keine deutsche Übersetzung dieser Webseite.

General remarks about the application for a thesis topic:

DIMA has an application form sheet, see Thesis Application (coming soon), where you can apply for a thesis at DIMA! Please fill in this from properly and submit!

 

In case you are already familiar with the DIMA group, as a student over some semesters, or even as a research/teaching assistant, then you know already DIMA/DFKI people and their topics, and you can – besides the formal application – contact them directly in parallel!

 

In case of difficult questions (e.g. special study programme, external partners, etc.), please contact me personally: Dr. Ralf-Detlef Kutsche. In my position of „Academic Director“ at DIMA, I take the responsibility to coordinate all Master’s and Bachelor’s theses for Prof. Markl and the DIMA/DFKI-IAM research group, which means in detail: 

  • The formal procedure of applying and registering for a thesis (with a proposal coming from you, the candidate and your potential advisor) will bring your final proposal to my mailbox. In general, this will happen after you have reached an agreement with your chosen advisor from the DIMA and DFKI-IAM groups. Comments, required changes, additional advice, etc. will come back to you from my side in due time.
  • In case of industrial collaboration, there must be a face-2-face meeting with me, in order to clarify the rules and conditions we impose at DIMA and the DFKI-IAM department of Prof. Markl.
  • The final defenses of each thesis will be done in our DIMA MSc/BSc colloquium (typically we have one Friday afternoon per month) under my direction.
  • If you are completely lost in „idea space“, then you can ask for an appointment with me, for a more general guidance!

Contact: Dr. Ralf-Detlef Kutsche

  • by email (ralf-detlef.kutsche@tu-berlin.de [1]), or
  • in my office hours (EN-726, Tuesday, 12-13, by appointment).

Researchers and Theses Opportunities

Dr. Kaustubh Beedkar [2]

kaustubh.beedkar@tu-berlin.de

Research Area: "Geo-Distributed Data Analysis"

Topic Area: Constraint-aware Query Processing for Geo-Distributed Data Analysis

Many large organizations today have a global footprint and operate data centers that produce large amounts of data at different locations around the globe. Analyzing such geographically distributed data as a whole is essential to derive valuable insights. Typically, geo-distributed data analysis is carried out either by first communicating all data to a central location where analytics is performed or by a distributed execution strategy that minimizesdata communication. However, legal constraints arising from regulationspertaining to data sovereignty and data movement (e.g., prohibition of the

transfer of certain data across national borders) pose serious limitations to existing approaches. In this context, our research explores

  1. various possibilities for declaratively specifying legal constraints and
  2. methods and algorithms to automatically derive distributed execution strategies under such constraints.

Please arrange a meeting with me to discuss concrete thesis opportunities. Students are also encouraged to propose their own topic in the ambit of above research problem.


Prerequisites: Strong programming skills (preferably in Java), knowledge in query planning and execution in DBMS,

(nice to have) taken IDB-PRA, DBT, or other database lab courses and seminars.

Dr. Alexander Borusan

alexander.borusan@tu-berlin.de

Research Area: „Data Streams Management in Embedded Information Systems“

Topic Area: Data Stream Modeling and Processing

Typical applications of embedded information systems (automotive, avionics, manufacturing control) include two main tasks: monitoring and controlling. Many sources of such applications produce data continuously as a stream. A data stream is an ordered sequence of data that can be read only once and should be processed in real time. From the point of view of a data stream management, several tasks need to be solved: modelling (data stream models), processing (data structuring and data reduction), quering (types of queries), scheduling, and storaging. Additionaly data streams analysis becomes one of the important tasks in the last decade.


Data Streams Modeling:

Architectures and models of real time streaming in embedded information systems (automotive, avionics)


Data Streams Processing:

Taxonomy and comparison of the data reduction techniques for the data streaming in the automotive applications


Dr. Sebastian Breß [3]

sebastian.bress@dfki.de

Research Area: „Data Management on Modern Hardware“

Topic Area: Data Management on Modern Hardware

Modern Hardware influences data management in virtually every aspect. With growing main memory sizes to the Terabyte scale, we can keep databases in-memory. This shifts the performance bottleneck from disk IO to main memory access and computation. This has a number of consequences: Database operators need to be tuned to be cached efficiently, and must be multi threaded to make efficient use of modern processors, including new processor architectures such as GPUs or MICs. Query execution needs to be made CPU and cache efficient. Transaction processing needs can be optimized using hardware transactions and non-volatile memory.

We also encourage students to propose own topics in the field of data management on modern hardware.

Prequisites: Strong programming skills in C/C++, deep knowledge in database implementation techniques
Nice to have:
Knowledge in LLVM, CUDA or OpenCL.

 

Detailed Topics proposals:

Collaborative Query Processing on Heterogenous Processors

Here, you would extend CoGaDB and evaluate the performance impact of collaborative query processing. The most important related work is the Morsel Paper (the paper proposes a strategy to make a database NUMA aware, but the same idea can be applied to heterogenenous processors with dedicated memory).

 

Prototype a High Performance Stream Processing System

Hardcode 4-5 streaming queries in C and compare their performance to Apache Flink and Apache Storm. The goal is to find out whether it is beneficial to write a specialized code generator for streaming systems. If yes, prototype a simple code generator to support some easy streaming queries.

Chaudhary, Ankit [4]

ankit.chaudhary@tu-berlin.de

Research Area: „Query Optimization and Operator Placement in Distributed Stream Processing Systems"

Topic Area: Query Optimization in Distributed Stream Processing Systems

"The processing of geo-distributed data streams is a key challenge for many Internet of Things (IoT) applications. Cloud-based SPEs process data centrally and thus require all data to be present in the cloud before processing. However, this centralized approach becomes a bottleneck for processing data from millions of geo-distributed sensors on a large scale IoT infrastructure. At DIMA, we are working on next generation data management platform for IoT called NebulaStream (NES) (www.nebula.stream [5]). The NES system is designed for utilizing the centralized cloud with decentralized fog devices to mitigate the bottlenecks of centralized cloud infrastructure. One major challenge for a SPE in this unified fog-cloud environment is to execute millions of concurrent queries. NES should be able to optimize and deploy the incoming queries at a high speed. 

In this context, I explore how the query optimization placement for such a large number of queries in a geo-distributed unified fog-cloud environment can be done efficiently. I am further exploring the possibilities of mitigate the effect of transient failures or change in the performance of the infrastructure on the running queries. 

If you are interested in the topic of build next generation query optimization algorithms for handling a large and evolving workload in a dynamic environment, please do not hesitate to contact me. Please make sure to provide me with some information about you, your interests, your prior experience, your programming skills, and your CV. Ideally, you have coursework or other prior experience in C++ programming and distributed systems (stream processing engines, databases). A good understanding of Query Optimizers for DBMS is desirable."

Bonaventura Del Monte [6]

delmonte@tu-berlin.de

Research Area: „State Management for Distributed Data Stream Processing“

Topic Area: Improvement of the End-to-End Management of Large State in Distributed Stream Processing Engines

I offer Master's theses related to my research area, which seeks to enhance the end-to-end management of large state in distributed stream processing engines. Resource elasticity, faultolerance, load balancing, robust execution of stateful query, and query plans optimization are first-class citizens of my research agenda.

If you are interested in a topic in this area, please, provide me with some information about you, your interests, your programming skills, and your CV.

 

Detailed available topics:

End-to-End Management of Large Scale Streaming Experiments.

Executing distributed experiments is a tedious, error-prone process, especially when stream processing engines (SPEs) are involved. When benchmarking the behaviors of a SPE, the researchers are always after a number of metrics, which are in the form of time series. In a distributed setup, the number of metrics grows linearly with the number of nodes. Furthermore, faulty behaviors might show up at some point of the experiment. The goal of this thesis is to design and develop a framework that allows researchers to define and run large scale experiments involving few systems (e.g., SPEs, data generators, and a number of third-party systems), gather all the metrics the users are interested in from those systems, and provide a GUI that help them to analyze the results.

Requisites:   strong programming skills in Java (and C++), good understanding of the JVM and its memory model, and good knowledge of Apache Flink APIs

Nice to have: good knowledge in one (or more) of the following topics: Apache Flink internals, network programming, and distributed systems

Behrouz Derakshan [7]

behrouz.derakhshan@dfki.de

Research Areas: "Training and Deployment of Machine Learning Pipeline"

Topic Area: "Efficient Mini-batch Retraining of DNN Pipelines"

Training a machine learning pipeline is the first step in machine learning applications. After the initial training, one must deploy the pipeline and model and make them available for prediction answering. The deployed model must be constantly retrained with newly available training data to ensure the model quality does not degrade.

Recent work [1] shows that full retraining is not always required. [1] proposes to continuously train the deployed model using samples of the historical preprocessed data. Their approach achieves similar quality while reducing the total data processing and model training time by an order of magnitude.

In this thesis, we would like to investigate the applicability of the approach in [1] to large DNN models.

Specifically, students must implement the techniques in [1] in TensorFlow and investigate its impact on the quality of DNNs and overhead in deployment (using TensorFlow Serving [3, 4] or TFX [2,5,6]) when compared to the full retraining approach.

 

Prerequisites:

·       Familiarity with stochastic gradient descent optimization in ML and mathematical concepts behind it

·       Basic familiarity with neural networks and training approaches for them

·       Experience with Python and TensorFlow

 

References:

[1] Derakhshan, Behrouz, et al. "Continuous Deployment of Machine Learning Pipelines." EDBT. 2019.

[2] Baylor, Denis, et al. "Tfx: A tensorflow-based production-scale machine learning platform." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.

[3] Olston, Christopher, et al. "Tensorflow-serving: Flexible, high-performance ml serving." arXiv preprint arXiv:1712.06139 (2017).

[4] github.com/tensorflow/serving [8]

[5] www.tensorflow.org/tfx [9]

[6] github.com/tensorflow/tfx [10]

Gabor Gévay [11]

gevay@tu-berlin.de

Research Area: „Programming Models in Data Analytics“

Topic Area: Compiling Datalog Programs to Apache Flink

Datalog is a highly declarative database query language, allowing users to express queries in a concise way and leaving the details of execution to the system. There are several works [1,2,3,4,5,6,7] aiming to execute Datalog queries on large amounts of data by using distributed dataflow systems such as Apache Spark. Building on this line of work, the aim of this thesis is to execute Datalog queries on Apache Flink. Flink is a state-of-the-art dataflow system supporting cyclic dataflows [8], which makes iterative computations more efficient. Therefore, Flink would be well-suited to implement the so-called semi-naive bottom-up strategy [9] of executing Datalog queries.

The closest work and state of the art is probably [5]. The Flink implementation could have a better performance, and it could even be simpler by relying on Flink’s incremental iterations. Several complications that arise in [5], such as dealing with the overhead of launching many dataflow jobs, are automatically solved by Flink’s cyclic dataflows, whereby the entire program is executed as a single (cyclic) dataflow job.

 

[1] Afrati, F. N., Borkar, V., Carey, M., Polyzotis, N., & Ullman, J. D. (2011, March). Map-reduce extensions and recursive queries. In Proceedings of the 14th international conference on extending database technology (pp. 1-8). ACM.

[2] Shaw, M., Koutris, P., Howe, B., & Suciu, D. (2012, September). Optimizing large-scale Semi-Naïve datalog evaluation in hadoop. In International Datalog 2.0 Workshop (pp. 165-176). Springer, Berlin, Heidelberg.

[3] Bu, Y., Borkar, V., Carey, M. J., Rosen, J., Polyzotis, N., Condie, T., ... & Ramakrishnan, R. (2012). Scaling datalog for machine learning on big data. arXiv preprint arXiv:1203.0160.

[4] Wang, J., Balazinska, M., & Halperin, D. (2015). Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. Proceedings of the VLDB Endowment, 8(12), 1542-1553.

!! [5] Shkapsky, A., Yang, M., Interlandi, M., Chiu, H., Condie, T., & Zaniolo, C. (2016, June). Big data analytics with datalog queries on Spark. In Proceedings of the 2016 International Conference on Management of Data (pp. 1135-1149). ACM.

[6] Wu, H., Liu, J., Wang, T., Ye, D., Wei, J., & Zhong, H. (2016, November). Parallel materialization of datalog programs with Spark for scalable reasoning. In International Conference on Web Information Systems Engineering (pp. 363-379). Springer, Cham.

[7] Rogala, M., Hidders, J., & Sroka, J. (2016, June). DatalogRA: datalog with recursive aggregation in the Spark RDD model. In Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems (p. 3). ACM.

[8] Ewen, S., Tzoumas, K., Kaufmann, M., & Markl, V. (2012). Spinning fast iterative data flows. Proceedings of the VLDB Endowment, 5(11), 1268-1279.

[9] Ceri, S., Gottlob, G., & Tanca, L. (1989). What you always wanted to know about Datalog (and never dared to ask). IEEE transactions on knowledge and data engineering, 1(1), 146-166.

Dr. Holmer Hemsen [12]

holmer.hemsen@dfki.de

Research Area: „Scalable Signalprocessing / Industrie 4.0“

Topic Areas: Data Analytics of Massive Time Series, Intelligent and Scalable Resource Management for Industrie 4.0

Data Analytics of Massive Time Series
A time series is a set of observations each recorded at a specific time. Examples of time series are manifold, e.g. electrocardiography curves, stock market data, seismic measurements, network load. Time series analysis comprises a wide range of methods, such as, anomaly and outlier detection, forecasting, and pattern recognition. The focus of this topic area is on research of methods for analysis of massive and/or multi-dimensional time series data.

Intelligent and Scalable Resource Management for Industrie 4.0
The goal of Industrie 4.0 is to digitalize, automate and optimize industrial production systems. In many cases this involves upgrading of conventional production systems into cyber-physical systems, often by utilizing Internet of Things (IoT) technology. The focus of this research topic is on methods for scalable optimization of production lines and intelligent forecasting of consumable resources to calculate optimal dynamic maintenance strategies.

Prerequisites: Strong programming skills in Java, Scala or Python; Good writing skills; Preferable knowledge of Apache Flink

 

 

Dr. Zoi Kaoudi [13]

zoi.kaoudi@tu-berlin.de

Research Area: „Data management for Machine Learning Systems“

Topic Areas: : Debugging ML Systems

Nowadays many libraries and systems have been developed for building ML models over big data. A key stumbling block in truly leveraging big data analytics insights is the need not only for big data debugging but also for ML debugging. Most existing tools and frameworks have been developed for code-based debugging and are neither flexible nor sufficient for data and ML debugging. We seek to develop a powerful, flexible and intuitive ML debugging system that handles debugging of an entire ML workflow, from model training and inference to ML diagnostics

Topic Areas: Scalable machine learning systems for streaming graphs

Many real-world applications require machine learning on graphs that are constantly changing (streaming graphs). For instance, in social networks, where new friendships are continously created, applications need to monitor and predict new connections to improve their recommendation algorithms. In the Internet of Things, where millions of participants change their physical location continously, applications require to handle a highly dynamic behavior. We seek to develop a system that can run analytics and machine learning tasks over streaming graphs.

 

 

Martin Kiefer [14]

martin.kiefer@tu-berlin.de

Research Area: „Approximate Data Analysis Using Modern Hardware“

Topic Areas: Data Stream Summarization Using Custom Hardware (FPGAs), Improving Query Optimization Using Modern Hardware

My research investigates the combination of data approximation techniques and modern hardware architectures to increase the efficiency data analysis.

 

Data Stream Summarization Using Custom Hardware (FPGAs)


Power-efficienct data analysis is an increasingly important problem in the era of big data: The amount of available data continues to increase exponentially and for economic and environmental reasons, we need to ensure that the energy demands required to analyze the data do not grow exponentially as well. I am approaching this problem for data stream analysis by combining the potential of stream summarization techniques and custom hardware on FPGAs.


Improving Query Optimization Using Modern Hardware


The query optimizer is at the heart of state-of-the-art relational database systems. It derives different execution plans for a given query and selects the cheapest one based on statistics and a cost model. However, this has to be done in a very tight time budget, since query optimization delays query execution. I am investigating how modern hardware can help with this task. In particular, I’m improving the statistics available to the query optimizer by using bandwidth-optimized kernel density models as learning selectivity estimators on GPUs.


Requirements

I offer thesis topics based on current research questions, student interests, and student skills. Students with an interest in modern hardware are preferred, but I may also provide thesis topics without a hardware focus. Skills in C/C++, OpenCL, VHDL, or Python programming might be useful.


If you are interested in my research topics, we can arrange a meeting for a discussion. Please include a CV in your request.

Dr. Ralf-Detlef Kutsche [15]

ralf-detlef.kutsche@tu-berlin.de

Research Area: „Model Based Software and Data Integration“, focussing on „Semantic Concepts / Semantic Interoperability“

Topic Area: Model Based Methods for the Development of Heterogeneous and Distributed Information Systems in Large Scale

Since the 80’s of the last century, there is a huge discussion on the quality of software and information systems (which today is absolutely a top issue for our modern data world, as we can see from statements like „Data is the gold of modern times“ in the „big data“, „data analytics“, and „data science“ fields, where Prof. Markl and his groups play a fundamental role in the world with the BBDC – Berlin Big Data Center, with the university spin-off Data Artisans (Apache FLINK), and with several International Master’s programmes and tracks like Erasmus Mundus IT4BI and BDMA, EIT Digital ICT-Innovation Master track „Data Science“, TUB local Master track „Data Analytics“) and many other activities.

Unfortunately, two Gardner studies of 1985 and 2005 show the same desastrous result: Only approx. 25% of the software a projects started come to a successful end in time and in budget, another 25% after some time delay, budget overriding, and maybe even reduced functionality and performance. Remaining 50% die on the way, or never are even started properly after some initial planning! 

Model based methods (in earlier times following the MDA (Model Driven Architecture) ideas of the OMG (Object Management Group, an international stardardisation and management consortium of almost all active large companies in the world) since many years promise to improve as well the quality as also to reduce the cost of software dramatically (up to 70%) by models applied to all (!) phases of the whole software process – in our case for the development of (potentially, but in  most cases) Heterogeneous and Distributed Information Systems (HDIS) in large scale!

Applying models (e.g. UML models in simple cases, but better: ‚domain specific modeling languages‘) and semantic concepts (e.g. ontologies, formal logic and semantics, metadata stardards and (meta-) thesauri can support these methods significantly, as the results of two very large industrial R&D projects (among many others) under my scientific guidance show (e.g. BiZYCLE, 2007-2010, and BIZWARE, 2010-2013, funded by the German ministry of research BMBF).

Candidates being interested in these topics should have an excellent background in databases and information systems, in software engineering and software architecture, in formal methods and mathematics or theoretical computer science, particularly logic formalisms and languages, and, of course, in modeling with classical modeling languages in the UML family, with E/R (anyway known from each DBS couse), with BPMN or any other process/workflow modeling language, and should be interested in application domains (like health care, my main application area since 30 years, automotive industry, business intelligence, and, very relevant for the future, the energy sector!)

In case you fulfill these requirements, and you participated in my classes, or you can prove your knowledge gained from other universities in these fields, please apply for a thesis under ralf-detlef.kutsche@tu-berlin.de [16] or in my office hours (Tue, 12-13, during semester time, by appointment).

Clemens Lutz [17]

clemens.lutz@dfki.de

Research Area: „Fast & Scalable Data Processing on Modern Hardware“

Topic Area: Fast & Scalable Data Processing on Modern Hardware

GPUs and other modern processors are capable of very fast data processing. For example, a high-end Nvidia GPU is capable of reading at 900 GB/s from its built-in memory, and can compute 14 billion floating-point operations per second (i.e., 14 TFLOPS). This is more than 20 times faster than regular CPUs.

Our aim is to use this processing power to analyze data. This goal opens the door to many research challenges. For example, large data sets do not fit into the GPU’s on-board memory. How can we efficiently access data from a GPU? How can we make a SQL JOIN operator run fast?

Contact me via E-mail if you like writing fast code and are interested in programming GPUs.

Thesis are available for Bachelor’s and Master’s students. Include a short text about your skills and research interests, and attach your CV. I encourage you to propose your own thesis topic.

 

 

Dr. Alireza Mahdiraji [18]

alireza.rm@dfki.de

Research Area: „Approximate Query Processing on Data Streams“

Topic Area: Distributed Summarization Data Structures

Many real-world applications (e.g., traffic monitoring, cluster health monitoring, web log analysis, online services) generate data streams at unprecedented rate and volume. Traditional query processing over such massive amounts of streaming data often results in high latency and increased computational cost. This overhead is even more pronounced for query processing over distributed data streams. On the other hand, Approximate Query Processing (AQP) provides approximate answers to queries at a fraction of cost of the original queries and is a mean to achieve interactive response times (sub-second latencies) when faced with voluminous data. Interactive query response times (at the cost of accuracy) is useful for many tasks like exploratory analytics, big-data visualization, or trend analysis. In particular, AQP techniques utilize data synopses (or summaries), much smaller representations of the data, used to quickly answer queries at the cost of accuracy. Examples of such synopses are using samples, histograms, wavelets, and sketches.

Our research focuses on developing methods for efficient construction and maintenance of data synopses for large amounts of streaming data that is generated in a distributed fashion.

Diogo Telmo Neves [19]

 diogo_telmo.neves@dfki.de

Research Area: „Data Management“

Topic Area: Synthetic Data

Title: Synthetic Data Generator from Datasets, Structural and Statistical Metadata, and
Domain Knowledge.


Description: Due to legal, ethical, privacy or others issues, getting access to datasets is extremely difficult, if not impossible. This is a very common bottleneck that impairs data analytics in many domains (e.g. the healthcare domain). The aim of the research of this project is to develop a synthetic data generator. The generation of synthetic data should be guided by metadata as well as by domain knowledge and not just by statistical methods and metrics that can be derived from raw data. To be flexible enough,
the generator should have a set of parameters and hyperparameters that would allow, for instance, the generation of data points that would be classified as abnormals or outliers.


Expected Outcome: A (Python) library that allows to generate synthetic data from datasets, structural and statistical metadata, and domain knowledge.


Requisites:
• Strong skills on using libraries such as scikit-learn and Pandas.
• Strong knowledge of procedural, functional, and object-oriented programming paradigms.
• Strong programming skills in at least one of the following programming languages: Python, Java, and C++.
• Strong knowledge about algorithmic complexity.


Nice to Have:
• Good knowledge on statistics.
• Good knowledge of machine learning algorithms and techniques (e.g. GAN).
• Good knowledge of RDBMS as well as graph databases (e.g. Neo4j).
• Good knowledge on concurrent, parallel, and distributed programming.


If this topic sounds to you, please, send me an email with a few lines about you, your
background, your research interests, your programming skills, and attach to it your CV.

Topic Area: Data Quality

Titel: Automated Improvement of Data Quality by Incorporating Structural and
Statistical Metadata, and Domain Knowledge.


Description: Often, datasets lack (data) quality which impairs the accuracy of machine learning algorithms and turn harder, if not impossible, the implementation of a prediction model that is able to generalize well on unseen data. Thus, poor data quality has far reach consequences than the ones that a data scientist can initially foresee. The aim of the research of this project is to investigate if it is possible to automate and improve the quality of datasets by incorporating structural and statistical metadata and domain knowledge as means to detect data issues and report on them, and, then, apply data transformations to automatically fix the previously detected data issues and maintain a data lineage that allows to understand and explain every data transformation that has been applied to the original dataset.


Expected Outcome: A (Python) library that automates the improvement of data quality.


Requisites:
• Strong skills on using libraries such as scikit-learn and Pandas.
• Strong knowledge of procedural, functional, and object-oriented programming paradigms.
• Strong programming skills in at least one of the following programming languages: Python, Java, and C++.
• Strong knowledge about algorithmic complexity.


Nice to Have:
• Good knowledge on statistics.
• Good knowledge of RDBMS as well as graph databases (e.g. Neo4j).
• Good knowledge on concurrent, parallel, and distributed programming.
• Good knowledge in at least one of the following processing engines: Apache Flink, Apache Spark, and Apache Kafka.


If this topic sounds to you, please, send me an email with a few lines about you, your
background, your research interests, your programming skills, and attach to it your CV.

Alexander Renz-Wieland [20]

alexander.renz-wieland@tu-berlin.de

Research Area: „Large-Scale Machine Learning“

Topic Area: Large-Scale Machine Learning

Training machine learning (ML) models on a cluster instead of a single machine increases the amount of available compute and memory, but requires communication among cluster nodes for synchronizing model parameters. For some ML models, this synchronization can become the dominating part of the training process, such that using more computers does not result in the intended speed-up.

To avoid much of this communication, researchers developed algorithms that create and exploit parameter locality. That is, at a given point in time each of the workers updates only a subset of the model parameters. These subsets typically change throughout the training process, i.e., workers update different subsets throughout training. Such algorithms exist for multiple types of ML models. The locality can stem from the training algorithm, the ML model, or the training data.

ML developers typically need to implement such locality-exploiting algorithms from scratch, i.e., they have to know about low-level details of distributed computing. We are developing a system that allows researchers and practitioners to implement such algorithms without detailed knowledge of distributed computing. Our approach is to make the state-of-the-art architecture for distributed ML, so-called parameter servers, usable and efficient for locality-exploiting algorithms.  

I offer multiple thesis topics related to this line of work. For example, theses can work on aspects of the system or apply the system to specific ML models.

If you are interested, don't hesitate to contact me to arrange a meeting. Please provide me with some information about you, your interests, your prior experience, your programming skills, and your CV.

Ideally, you have coursework or other prior experience in machine learning, C++ programming, and/or distributed systems.

Viktor Rosenfeld [21]

viktor.rosenfeld@dfki.de

Research Area: „Adapting Data Processing Code to Different Processor Types Without Manual Tuning“

Topic Area: Data Processing on Heterogeneous Processors

In the last decade, processors have become increasingly diverse, parallelized, and specialized for specific tasks. For example, in addition to multi-core CPUs, there are GPUs, Intel Xeon Phis, and FPGAs. Often, developers have to write program code that is specific to a particular processor to fully exploit its resources.

In my research, I study how the database can adapt its operator code automatically to the processor that it’s currently running on. In essence, my goal is to write a database system that learns how to rewrite itself until it runs as fast as possible on any given processor. To this end, I work a lot with OpenCL which is a programming standard that enables users to run the same program on different types of processors such as CPUs, GPUs, etc.

Prerequisites: Strong programming skills in C/C++, interest in low-level programming, interest in processor architecture.
Nice to have: Knowledge in GPU programming (e.g., OpenCL and/or CUDA); interest in automatic tuning.

I offer to mentor both bachelor and master thesis in the context of data processing on heterogeneous processors. I encourage students to develop their own ideas. The proposals below can be used as a starting point.

Please contact me via email to discuss your ideas for a thesis. Be sure to include a short text about your skills and interests, and attach your CV.

 

Detailed topic proposals:

Evaluation of Hash-Based Grouped Aggregation Algorithms on GPUs

Hash-based grouped aggregation has been studied extensively on multi-core CPUs. In general, one of three algorithms works best, depending on the group cardinality. However, as the cardinality is not always known in advance, there are also algorithms that do not assume prior knowledge and degrade gracefully to large cardinalities.

The goal of this work is to port and adapt these algorithms, which are written for multi-core CPUs, to GPUs using OpenCL as a target language. The algorithms should be integrated into an existing test suite. A thorough evaluation of these algorithms and a comparison with existing algorithms is also part of this thesis.

 

I

Dr. Jorge Arnulfo Quiane Ruiz [22]

jorge.quiane@tu-berlin.de

Research Area: „Scalable Data Management“

Topic Area: Big Data Processing

More and more applications must extract useful information from complex data, i.e., from large volumes of data that is getting produced at a high velocity and variety. Today, it is not important but crucial to use scalable big data processing tools and techniques to cope with current applications needs. Have you once wondered one of the following questions? How databases are coping with complex data? How dataflow systems (e.g., Flink or Spark) work? How can one make effective use of existing big data systems? How can big data help a company or organization? How can big data help machine learning? How would big data be in the future? If so, you could then find your topic with us.

Topic Area: Data-Related Ecosystem

Today, data intelligence is monopolized by a few numbers of companies. This is mainly because two reasons: they own both large amount of data and big data (including AI) technologies. It is therefore vital to provide new ways (an ecosystem) for sharing data and big data technologies so that the entire society can benefit from this new era of data intelligence. Building such an ecosystem is quite challenging as we do not have the right data infrastructure. If you want to be part of this adventure, come and join us in our Agora project, where we aim at devising the data infrastructure that will make possible such a data-related ecosystem.

Topic Area: Data Debugging

How many times have you heard that big data (bd), data science (ds), or machine learning (ml) is helping scientists to do great advancements in different domains? But, have you heard once how hard is to get those bd, ds, or ml pipelines shiny to be applied in practice? Yes, debugging your bd’s, ds’, and ml’s pipelines is like a ‘taboo’: nobody talks about it but we all suffer from it! Data debugging has come to break this taboo. We are developing a general-purpose data debugging system that provides interactive techniques for debugging bd, ds, and ml pipelines.

 

 

Juan Soto [23]

juan.soto@tu-berlin.de

Research Area: „Data Analysis / Data Analytics“

Topic Area: Exploratory Data Analysis, Numerics in Data Analytics

Exploratory Data Analysis

An Analysis of Current Approaches/Solutions for Big Data Problems and Devising Novel Technique

Numerics in Data Analytics

A Closer Look at Software Quality in Existing Big Data Analytics Libraries: Challenges and Pitfalls.

Dr. Eleni Tzirita Zacharatou [24]

eleni.tziritazacharatou@tu-berlin.de

Research Area: „Data Management for the Internet of Moving Things“

Topic Area: Data Management for the Internet of Moving Things

The Internet of Things (IoT) is a system of interconnected devices that exchange data over a network without human intervention. Some of the most important pieces in the IoT landscape are devices that can move (i.e. continuously change their geo-location over time) such as smartphones and tablets. Not only mobile devices are ubiquitous, but their computing capabilities are ever-increasing. Consequently, they are suitable to perform data processing tasks, thereby reducing data communication. However, executing queries over distributed mobile resources is challenging, mainly due to the dynamically changing connectivity among mobile nodes. In my research, I investigate the impact of mobility on an IoT data management system that answers thousands of queries over millions of devices. Furthermore, I explore methods and algorithms to mitigate the effects of mobility and enable robust query execution.

If you are interested, don't hesitate to contact me to arrange a meeting. Please make sure to provide me with some information about you, your interests, your prior experience, your programming skills, and your CV.

Ideally, you have coursework or other prior experience in C++ programming and distributed systems (stream processing engines, databases).

Dr. Steffen Zeuch [25]

steffen.zeuch@dfki.de

Research Area: „Query Optimization and Execution on Modern CPUs“

Topic Area: Query Optimization and Execution on Modern CPUs

Over the last decades, database system have been migrated from disk to memory architectures such as RAM, Flash, or NVRAM. Research has shown that this migration fundamentally shifts the performance bottleneck upwards in the memory hierarchy. Whereas disk-based database systems were largely dominated by disk bandwidth and latency, in-memory database systems mainly depend on the efficiency of faster memory components, e. g., RAM, caches, and registers.

With respect to hardware, the clock speed per core reached a plateau due to physical limitations. This limit caused hardware architects to devote an increasing number of available on-chip transistors to more processors and larger caches. However, memory access latency improved much slower than memory bandwidth. Nowadays, CPUs process data much faster than transferring data from main memory into caches. This trend creates the so-called Memory Wall which is the main challenge for modern main memory database systems.

To encounter these challenges and enable the full potential of the available processing power of modern CPUs for database systems, we propose theses to reduce the impact of the Memory Wall.

We also encourage students to propose own topics in the field of query optimization and processing on modern CPUs.

Requirements: Strong programming skills in C/C++, deep knowledge in database implementation techniques, good understanding of computer architecture
Optional Knowledge in LLVM, Vtune, MPI, OpenMP


Dr. Shuhao Zhang (Tony) [26]

Shuhao.zhang@tu-berlin.de

Research Area: "Data Stream Processing and Management"

Topic Area: Data Stream Processing and Management

Title: 

Topic Area: Data Stream Processing and Management

General Description: 

Data stream processing is a hot topic. It is a technique allowing users to process infinite data streams in real time. It is nowadays a buzzy word appearing almost everywhere because of internet of things, 5G, real time AI, etc.

To cope with the fast increasing application demand, many stream processing systems have been proposed in the last decades, such as Storm, Flink, Spark-streaming, and so on.

 

My research generally involves two aspects: 1) architecting novel stream processing systems & algorithms and 2) designing novel stream processing applications (e.g., online machine learning, trajectory analytics and so on).

 

If you are interested in a topic in this area, please, provide me with some information about you, your interests, your programming skills, and your CV.

 

You may also propose your own topics in the field of Data Stream Processing and Management.

 

Prequisites: Strong programming skills in C/C++/Java/Python

Nice to have: Knowledge in database/stream processing system/data mining/machine learning/timeseries & trajectory analytics

 

Detailed available topics:

1.     Transactional Stream Processing System

In this topic, you will work on TStream [1], a recently proposed stream processing system supporting concurrent stateful stream processing. You can choose to enhance the system from multiple aspects: 1) efficiently supporting of transaction abortion; 2) more scalable indexing technique; and 3) more efficient way of handling data-dependencies among state transactions. 

2.      Secure and Private Stream Processing

In this topic, you will work on Intel SGX [2] and/or Homomorphic Encryption (HE) [3] mechanism to support secure and private stream processing. The goal is to investigate how to efficiently implement some of the important stream processing operations (e.g., stream join, window aggregation) in Intel SGX and/or with HE scheme.

3.     Progressive and Interactive Applications

In this topic, you will implement novel stream applications on one of the stream processing systems such as Storm/Flink/Spark-Streaming. The goal is to investigate how to better support applications such as online machine learning/data mining [4], fast trajectory exploration [5] and so on.

[1] Zhang et al. Towards Concurrent Stateful Stream Processing on Multicore Processors, ICDE’20

[2] Segarraet al. Using Trusted Execution Environments for Secure Stream Processing of Medical Data, Distributed Applications and Interoperable Systems, 2019

[3] Burkhalter et al. TimeCrypt: Encrypted Data Stream Processing at Scale with Cryptographic Access Control, NSDI’20

[4] Fedoryszak et al. Real-time Event Detection on Social Data Streams, KDD'19

[5] Ang et al. TraV: An Interactive Exploration System for Massive Trajectory Data, IEEE BIGMM’19

 

 

------ Links: ------

Zusatzinformationen / Extras

Direktzugang:

Schnellnavigation zur Seite über Nummerneingabe

Copyright TU Berlin 2008