BIFOLD Research Colloquium "State Management in Cloud-Native Streaming Systems" by Yingjun Wu (Singularity Data)
Title: State Management in Cloud-Native Streaming Systems
Venue: Virtual event
Registration: If you are interested in participating, please contact email@example.com.
Abstract: Streaming systems are becoming increasingly essential for extracting business value from data in real-time. To achieve different SLAs demanded by customers under constantly changing workloads, it is a must to take advantage of the scalable, resilient, diversified resources in the cloud. New demand opens new opportunities and challenges for state management, which is at the core of streaming databases. Existing approaches typically use embedded key-value storage so that each worker can access it locally to enjoy its high performance. However, it requires an external durable file system for checkpointing, is complicated and time-consuming to redistribute state during scaling and migration, and is prone to performance throttling. Therefore, we propose shared storage based on LSM-tree. State gets stored at cloud object storage and seamlessly makes itself durable, and the high bandwidth of cloud storage enables fast recovery. The location of a partition of the state decouples with compute nodes thus making scaling straightforward and more efficient. Compaction in this shared LSM-tree is now globally coordinated with opportunistic server-less boosting instead of relying on individual compute nodes. We design a streaming-aware compaction and caching strategy to achieve smoother and better end-to-end performance.
Bio: I am the founder and CEO of Singularity Data (https://www.singularity-data.com/ ), a startup innovating next-generation database systems. Before starting my adventure, I was a software engineer at the Redshift team, Amazon Web Services, and a researcher at the Database group, IBM Almaden Research Center. I received my PhD degree from National University of Singapore, where I was affiliated with the Database Group (advisor: Kian-Lee Tan). I was also a visiting PhD student at the Database Group, Carnegie Mellon University (host advisor: Andrew Pavlo). I earned my bachelor's degree from South China University of Technology.
I am passionate about integrating research into real-world system products. During my time in AWS, I was responsible for boosting Amazon Redshift performance using advanced vectorization and compression techniques. Before that, I participated in the development of IBM Db2 Event Store's indexing structure and transaction processing mechanism. During my PhD, I developed two main-memory DBMS prototypes, namely Peloton and Cavalia. I was also an early contributor to Stratosphere, which is now widely known as Apache Flink.