We are excited to introduce Fluss, a groundbreaking project designed to tackle longstanding challenges in streaming data storage and analytics. Named after the German word for 'river,' Fluss is engineered to offer a high-performance, scalable, and fully integrated solution for real-time data processing with Apache Flink®, driving our vision towards a complete unified batch and streaming data platform.
Apache Flink has become the de facto standard for stream processing and Flink SQL has become an indispensable tool for users to build real-time applications. Typical application scenarios include data cleaning, feature engineering and extraction, transformations, multi-stream merging and joins, aggregations, and more. Flink is often deployed along with a streaming storage layer like Apache Kafka and combined with real-time OLAP systems, such as Clickhouse and StarRocks. This enables the creation of streaming-first architectures that treat all incoming data as unbounded streams, with the ability to process and analyze data as it arrives.
The following figure depicts a streaming-first architecture:
IMAGE 1: STREAMING-FIRST ARCHITECTURE
Apache Kafka plays a central role in such architectures by acting as the backbone for storing and transporting real-time data. It can capture events from multiple sources—applications, sensors, and databases—and stream them in real time to consumers, like analytics platforms, microservices, or machine learning models. Kafka's high throughput, fault tolerance, and scalability make it a popular choice for building large-scale, real-time data pipelines.
On top of Kafka, Apache Flink provides a powerful framework for real-time data processing. Flink’s stream-first architecture allows it to handle continuous data flows, making it an excellent choice for event-driven applications that require sub-second latency. Flink’s support for stateful computations, windowing, and exactly-once processing semantics allows for complex operations such as event-time processing and real-time aggregations.
These capabilities have fueled the adoption of streaming-first architectures, allowing companies to develop solutions for real-time fraud detection, personalized recommendations, monitoring systems, and more.
Although Apache Kafka has excelled as a row-oriented, log-based storage layer, it falls short in certain key areas, especially regarding real-time data analytics. Log-based storage falls short in the following areas:
Apache Flink’s strength lies in real-time data processing, and it's often paired with Online Analytical Processing (OLAP) systems for deep-dive analysis of aggregated data. However, OLAP introduces its challenges in the architecture, particularly when considering storage costs. OLAP systems are designed to enable fast, multidimensional queries over large datasets, which are particularly useful for reporting, dashboards, and business intelligence use cases. However, this capability comes at a cost—literally.
Apache Flink uses RocksDB to store state, but it lacks a streaming storage layer to provide a database-level experience. Flink needs to stitch multiple upstream and downstream storage components, which increases the complexity of the system and affects the experience (some systems support upserts and some don’t, and primary key constraints are easily broken). There are a lot of variations between different storage layers, and behaviors vary greatly, leading to problems of not knowing what to choose. What is missing is a streaming storage layer that resembles a data warehouse while being native to Flink.
IMAGE 2: COLUMNAR STREAMING STORAGE IS MISSING
It is difficult to transform existing streaming solutions to fit these needs, so we developed a streaming storage designed for real-time analytics. This started the development of Fluss, a purpose-built storage layer optimized for modern stream processing needs.
IMAGE 3: ARCHITECTURE OVERVIEW WITH FLUSS
Fluss is designed to provide a unified streaming storage layer that addresses these specific limitations head-on. The key features and benefits of Fluss in the world of streaming data analytics include:
It also provides lakehouse tiered storage such as Apache Paimon and Apache Iceberg, allowing bi-directional communication with lakehouses. This allows a streaming job to load state from batch sources, enabling seamless state initialization and synchronization between batch and streaming data.
Fluss has been running in production for the last few months. It will be open-sourced and donated to the Apache Software Foundation (ASF) at the end of the year.
In the age of AI and machine learning, where data must be both accessible and immediately actionable, Ververica provides a solution for businesses looking to harness real-time and historical data together. Whether you're building advanced analytics pipelines, monitoring real-time events, or implementing AI-driven insights, Ververica’s Streaming Data Platform provides the flexibility, speed, and efficiency you need to succeed.
We hope that the idea of streaming storage for Apache Flink that can be shared across multiple systems holds the potential to significantly reshape the landscape of real-time data processing and analytics. By integrating a native, purpose-built storage system within Flink, organizations can achieve a unified architecture that seamlessly supports both real-time and long-term data processing. This would eliminate the current need for intermediate Kafka topics or separate stream processing tools, simplifying data pipelines and reducing operational complexity.