Student: Jonas Traub
Advisors: Dr. Asterios Katsifodimos, Paris Carbone (KTH, Sweden) Prof. Dr. Volker Markl, Seif Haridi (KTH, Sweden)
Stream processing differs significantly from batch processing: Queries are long running, continuously consuming data from input streams and, in turn, producing output streams. A stream can be defined as a data source that infinitely produces/emits data-items. Hence, it is impossible to store all the history of emitted items in order to query them. In order to produce answers to a given query over an infinite set of items, a stream is split (or, discretized) into smaller subsets of data-items, called windows. Consecutively, an operation such as a Join, Aggregation or a Reduce operator is applied on the discretized window.
In this thesis, we designed and implemented highly expressive means of window discretisation in the Apache Flink stream processing engine. The rules for the discretization of a stream are called windowing policies. Our discretization operators are event-driven, and triggered by data-item arrivals. When a data-item arrives, trigger and eviction policies are notified. A trigger policy specifies, when a reduce function is executed on the current buffer content. An eviction policy specifies when data-items are removed from the buffer.
This thesis is the first to our knowledge to propose the use of windowing policies in the form of UDF in a data parallel execution engine. Expressing windowing policies as UDF results in very high expressivity and flexibility. Thus, very complex queries can be defined, going much beyond the predefined count-, time-, punctuation-, and delta-based windows. This thesis is also the first to implement a streaming API that allows user-defined trigger and eviction policies in a data-parallel stream processing engine.