Analysis of Stream Processing Operators

The last decade has seen the emerge of data processing systems that specializes to process data “in flight”, as opposed to regular database systems where data to be processed is mainly stored on disk.  This class  systems, i.e, Complex Event Processing (CEP) and Data Stream Processing Systems (DSMSs) like [1,2,3] receive data as streams from their sources (health sensors, web click streams, etc.), and process the data while it flows through memory. In modern computing architectures, like cloud and IoT platforms with several heterogenous devices, we can achieve higher performance by distributing the processing among connected devices. However, this task – if we try to achieve the highest possible performance, is complex. Reasons being the unique characteristics of the streams (rate), the structure and complexity of the queries (one or more stream operators), and the interplay between the system and the underlying network.

The DMMS research group is currently doing research on distribution of stream processing, motivated by enabling private preserving and robust real-time health monitoring. A central aspect to understand how to best achieve high performance in this setting is to fully understand the unique characteristics of the different stream operations (JOIN, Aggregates (like SUM, AVG), machine learning, privacy enabling operators, etc.). This is highly relevant to current research, because query distribution (operator placement) has received high attention in the last few years, however published works often tend to threat operators more or less equally – and might therefore miss out on important performance aspects from the different  operator characteristics specifically.

 

We would like to offer potentially multiple master theses that use our testbed to carry out extensive distributed CEP performance analysis to understand the impact of stream operators, network configurations and stream sources to performance. The starting point of this could be to utilize our container based distributed CEP testbed, in which we already have set up an open source CEP called Siddhi [3] on top of Docker containers and ns-3 for network emulation [4], together with our own platform for query distribution. Another starting point could be to focus more on operator performance on real-world devices like Raspberry Pi..

 

Thesis Details

The thesis will be given as a long thesis, which means 60 credits. The thesis will include surveying state of the art work on distributed processing / DCEP and operator placement, as well as programming work on testbed, and extensive analysis and performance evaluation.

The master student should be experienced in programming (Java, C and C++) and have good knowledge on networks and databases. Master courses INF5100, INF5090 INF31/4150 are highly relevant, and should be taken as part of the master study.

 

Relevant links / papers:

[1] Esper - http://www.espertech.com/esper/

[2] Apache Flink - https://flink.apache.org

[3] Siddhi CEP - https://github.com/wso2/siddhi

[4] NS-3 Network Simulator - https://www.nsnam.org

[5] F. Starks, S. Kristiansen, and T. Plagemann. DCEP-Sim: An open simulation framework for distributed cep: Introduction for users and prospective developers. In DEBS 2018

Emneord: CEP, IoT, mHealth
Publisert 12. aug. 2019 14:19 - Sist endret 12. aug. 2019 14:19

Omfang (studiepoeng)

60