direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Kurzinfo

Bearbeiter: Till Rohrmann

Betreuer: Sebastian Schelter

Angestrebter Abschluss: Master

Zusammenfassung

In recent years, the amount of gathered data and thus its significance for society and industry has increased exponentially. The data sets often exceed the main memory of a single computer, posing a serious problem for current analytic tools, such as MATLAB and R. In order to deal with large-scale data, it is mandatory to exploit parallelism.

In this talk, we will present Gilbert, a sparse linear algebra environment, to solve the imminent lack of analytic capacities. Gilbert offers a MATLAB-like programming language for linear algebra programs, which are automatically executed in parallel on massively parallel dataflow systems. Thereby, it frees the user from the tedious and error-prone task of writing parallel code. In order to achieve this task, Gilbert compiles MATLAB code into an intermediate representation. This language-independent representation allows high-level linear algebra optimizations. The optimized Gilbert code is translated into an execution plan which can be executed on Apache Spark and Stratosphere/Apache Flink. We will discuss in detail how the linear algebra operations are mapped to these systems and how distributed matrices are represented.

Exhaustive testing indicates that Gilbert scales well to data sizes vastly exceeding the memory of a single machine. We successfully implemented the PageRank, the k-means and the Gaussian non-negative matrix factorization algorithm with Gilbert. These iterative algorithms are compared to specialized implementations in terms of execution time and implementation effort. We will explain why Gilbert falls short of the optimized versions with regard to performance. Yet, we believe that the increased productivity compensates for this loss.

Zusatzinformationen / Extras