Inhalt des Dokuments
(MSc) Glanceable Visualization of Dataflows
- © Marcus Leich
In-situ data processing is currently all the rage. You do no ingestion, no schema transformation, no cleansing, nothing that makes your live easier whatsoever, only to get your results fast. This approach leads to the following problem: you don’t get to know your data before you’re processing it. Instead, you get to know your data when it breaks your code.
Your default statistics like min, max, avg, median, histogram may help for simple numeric data. However, especially text data or (semi-) structured data call for different approaches. Aside from knowing what your raw data looks like at the input stage, in data flow programs it, is also crucial to understand how the different operations affect the data within the data flow.
It is the goal of this project to provide developers with a visualization tool that helps them to quickly understand what their data looks like (at the source and within the data flow) in order to rule out any false assumptions about the input and its behavior in the data flow, as soon as possible.
You should already have or should be willing to acquire the following skills during your thesis project:
- Java and/or Scala development skills
- statistics for analyzing datasets
- good understanding of visualization techniques
- knowledge of dataflow programming concepts (Hadoop MapReduce, Apache Spark, Apache Flink)
- M. Leich, “Runtime Analysis of Distributed Data Processing Programs,” PhD Workshop at VLDB 2014.
- A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke, “The Stratosphere platform for big data analytics,” The VLDB Journal — The International Journal on Very Large Data Bases, vol. 23, no. 6, Dec. 2014.