Every year, Apache Flink® sets new records in its development journey. Standing as a testament to its growing popularity, Flink now boosts over 1.6k contributors, 21k GitHub stars, and 1.4M downloads. In operational environments, Flink clusters are reaching impressive scales, with some individual clusters surpassing 2000 nodes. The largest known Flink infrastructure in production boasts over 4 million CPU cores, processing a staggering 4.1B events per second. If scalability is a concern, Flink has proven itself in the industry and stands as a reliable choice.
Over the past few years, Flink has witnessed widespread adoption across diverse industries, ranging from e-commerce and finance to gaming and healthcare. Mature solutions powered by Flink, such as personalized recommendations and robust risk control systems, now play a pivotal role in shaping the daily experiences of individuals.
In the realm of batch data processing, data warehouses and data lakes play pivotal roles. Data warehouses process structured data through ETL pipelines, with the results serving as fodder for other applications. As the data landscape evolves, especially with the surge in AI requirements, data lakes have emerged to store raw data of varied structures, including structured, semi-structured, and unstructured.
Enter the Lakehouse, an evolution of the data lake that aims to unify the strengths of data warehouses and data lakes. Lakehouses offers the scalability and flexibility inherent to data lakes, coupled with metadata and table formats.
In the realm of data processing, two crucial factors always come into play: cost and result latency. Visualizing this relationship, we can refer to it as the data processing triangle. If cost is a primary concern, the choice leans towards a Lakehouse; if latency takes precedence, real-time streaming becomes the preferred option. Real-time streams are indispensable for specific business cases such as risk control, while Lakehouses, through batch data processing, cater to a myriad of other scenarios and use cases.
As businesses grow and their dynamics become clearer, the need for low-latency data arises. The immediate consideration is often migrating jobs from a Lakehouse to a real-time stream. But does this transition truly make sense?
In large organizations running hundreds of thousands of batch jobs, the scarcity of streaming jobs is questioned. The common response attributes this scarcity to the absence of real-time business requirements. However, this seems paradoxical, as faster decision making inherently accelerates business operations—time is money, after all. An alternative perspective posits that perhaps the lack of readiness in streaming infrastructure is impeding the recognition of real-time business needs. The counterargument to this is that streaming jobs are typically expensive, leading to a dilemma: is it justified to invest significantly in building infrastructure when the business requirement is uncertain?
The overarching question emerges: is there a more efficient, cost-effective, and consequently, low-risk method to facilitate system upgrades?
Streamhouse disrupts the deadlock by providing a solution to upgrade high-latency batch jobs to Streamhouse with near-real-time latency. This allows businesses to assess its suitability by running operations and gauging results. After obtaining business outcomes, an evaluation of the ROI becomes feasible. The decision then rests on either adhering to Streamhouse with optimized costs or progressing further to real-time streaming computation. The seamless migration facilitated by the Flink unified engine and Flink SQL minimizes the effort involved.
This approach aligns with the data processing triangle concept, leading to the evolution of a data processing architecture pattern termed LSR (Lakehouse, Streamhouse, and Real-time stream). The cost increment occurs progressively from Lakehouse to Streamhouse and then to real-time streaming.
In any data processing pipeline, four key components play crucial roles: data ingestion, data computation, metadata, and data storage. Lakehouse and streaming data processing differ in their approaches. Lakehouse relies on batch ETL, batch data processing, and batch table format, saving data in object storage. On the other hand, streaming involves ingesting data in streaming mode, supporting Change Data Capture (CDC). Streaming processing engines like Flink process the data, with additional services like schema registry managing the schema, and data stored in the message system.
For instance:
Let's explore more details about the engine and the improvements made in Flink SQL.
The checkpoint duration plays a crucial role in end-to-end latency. A shorter checkpoint duration translates to lower latencies for transactional sinks, more predictable checkpoint intervals for daily operations, and less data to recover. Improving checkpoint duration involves addressing two major factors: the travel time of barriers through the pipeline and the time taken for checkpoints.
The GIC (Generic Log-based Incremental Checkpoint) introduces a state changelog, further reducing checkpoint duration and supporting an even shorter checkpoint interval. Statistics indicate that GIC could provide five times faster checkpointing, with the incremental file size significantly reduced by 95%.
Hybrid Shuffle emerges as a pivotal feature in the unification of batch and stream processing. While Pipeline Shuffle offers superior performance, it demands more resources as all interconnected tasks must start simultaneously. The allure of using Pipeline Shuffle in batch execution is tempered by challenges, especially in production environments where resource availability isn't guaranteed. The competition among multiple tasks for resources can even lead to deadlocks.
In Flink 2.0, the community is set to elevate the Flink state storage management system by transitioning to a fully decoupled storage and computation architecture, aligning with the demands of a cloud-native environment.
This transformative move will establish a tiered storage system referred to as the 'Tiered State Backend' architecture.
Key Enhancements Envisioned:
In Flink 2.0, a significant objective is the execution of API Evolution, which involves restructuring the API into four distinct categories:
The evolution unfolds in two phases:
By adopting SQL as the abstraction layer, Flink creates a symbiotic relationship. Users can focus solely on their business requirements, while Flink relentlessly concentrates on data processing and continuous optimization.
Regarding SQL capabilities, Flink already offers an extensive set of syntax, including DDL, DML, aggregation, joins, and more. Yet the full potential of Flink SQL is often underestimated. It can be categorized into four main types:
With these powerful SQL capabilities, users can accomplish a wide array of tasks.
Having delved into Flink CDC and Flink, let's explore the storage platform integral to the Streamhouse - Apache Paimon. Born as a Flink sub-project under the moniker "Table Store," Paimon has evolved into an Apache incubator project.
This architecture positions Paimon as a unified batch and stream data lake platform, boasting various exciting features:
Examining common data processing pipelines involving ODS ingestion, DWD transformation, and DWS, several intriguing scenarios emerge:
Scenario 2: Deploy Flink SQL for streaming or batch ETL from ODS to DWD. Allow OLAP engines to consume DWD, resulting in a significant reduction in end-to-end data latency.
Scenario 3: Utilize Flink SQL to construct end-to-end streaming mixed ETL processing (streaming and batch). Employ a unified technology for diverse data processing tasks, effectively reducing both development and operational costs.
The Streamhouse concept was introduced as a result of recognizing the gap between Lakehouse and real-time streaming data processing. Streamhouse was implemented by reshaping Flink and introducing tools like Flink CDC and Paimon. The result is a comprehensive evolution of the LSR data processing architecture—one driven by business value.
The architecture's flexibility shines when cost considerations take center stage, a crucial aspect in the current global economy. While there are use cases justifying direct investment in real-time streaming, the majority of batch data processing scenarios find effective solutions in the Lakehouse. Illustrating this point with examples like Iceberg and Spark, it's apparent that you might have your own tech stack for the Lakehouse.
The work accomplished at Ververcia demonstrates that, with an acceptable minute data latency, Streamhouse can significantly cut costs by 80% compared to real-time streaming. Employing streaming preprocessing within the Streamhouse yields five times better write performance and eight times better query performance when contrasted with traditional batch Lakehouses.
Based on these insights, Ververica built the next-generation data platform, Ververica Cloud, with a vision to empower customers to run their jobs anywhere, enhancing both development and operational efficiency. Ververica Cloud's architecture is open and flexible, supporting fully managed, semi-managed, and on-premises deployment modes. This architecture abstracts the diversity of cloud providers, allowing users to execute jobs directly in any cloud where their data is stored.
Ververica Runtime Assembly (VERA), the enterprise engine, is 100% Flink compatible and offers superior performance. Certain features of VERA will be contributed back to the open-source community in the future. At the platform level, Ververica Cloud introduces services like Autopilot, which leverages user data to train AI models for automatic resource scheduling, and Advisor, a repository of troubleshooting knowledge providing hints and suggestions for Flink job issues.
As a cost-effective product, Ververica Cloud offers flexible pricing options, including Pay-As-You-Go (PAYG) and Reserved Capacity (RC), tailored to diverse business requirements. Larger companies even have the option to bring their own cloud (BYOC) and integrate it with the Ververica ecosystem.
Additionally, at the conceptual level, Ververica Cloud is the first and only platform that supports both real-time streaming and Streamhouse. As customers, you will have the flexibility to start with Streamhouse and then decide to upgrade to real-time streaming or simply kick off with real-time streaming directly.
Lastly, at Ververica, we will continue our efforts on Streamhouse, both within Ververica Cloud and open-source Apache Flink, to support a broader range of business scenarios, consistently optimizing costs and improving value.
Ready to reach speeds of up to 2x faster than open source Apache Flink? Sign up for Ververica Cloud for free and receive $400 cloud credit.