Stream Processing & Apache Flink - News and Best Practices

Ververica donates Flink CDC - Empowering Real-Time Data Integration

Written by Ververica | 03 April 2024

Ververica has officially donated Flink Change Data Capture (CDC) to the Apache Software Foundation. In this blog, we’ll explore the significance of this milestone, and how it positions Flink CDC as a one-stop, real-time data integration solution based on Apache Flink.

Overview

CDC Connectors for Apache Flink® (also known as Flink CDC) is an open-source project on Github, developed by Veverica.

Flink CDC was initially launched in July 2020 by Jark Wu, Head of Flink SQL. Since then, the community has witnessed remarkable growth and innovation under the stewardship of Leonard Xu. The project introduced cutting-edge features that simplified the user’s CDC data processing links and provided an elegant data integration solution.

These advancements propelled the rapid growth of the project with:

  • Over 100 source code contributors
  • More than 5,000 GitHub stars
  • A vibrant community of over 10,000 members

What is Flink CDC?

Flink CDC is a streaming data integration tool. It works well with most mainstream databases to enable real-time integration of Change Data Capture (CDC) data technology based on database changelogs.

Flink CDC captures database changes (inserts, updates, deletes) as events and streams them to downstream systems for processing. It offers efficient real-time data handling, cost-effective scalability, adaptability to dynamic data environments, and streamlined integration using features like schema evolution and full database synchronization.

Today, Flink CDC is deployed in production environments by companies around the globe.

How Apache Flink and Flink CDC work together

Apache Flink is well known for its powerful pipelines and its extensive support for connecting to other systems. This ability to handle continuous streams of data makes it well-suited for CDC tasks where changes in the source database need to be processed in real-time.

Here's how it works: Flink CDC interprets and parses the user's YAML file, defining data source, sink, and transforms between them, then builds and submits a Flink job to start the pipeline of synchronization. As changes occur, they are read by Apache Flink, optionally transformed and routed by operators, and then streamed to downstream processing pipelines or systems for further analysis, reporting, or other purposes.

By using Apache Flink, Flink CDC can efficiently integrate large amounts of data in real-time. This capability is particularly useful in scenarios such as real-time analytics, data synchronization between different systems, maintaining replica databases, or triggering real-time actions based on database changes.

Ververica donates Flink CDC

In our role as a maintenance member of the open-source community, Ververica strives to broaden the impact of Flink CDC within the field of data integration. We collaborate with users and developers to build and strengthen the community.

As the Flink CDC project evolved over recent years, two copyright concerns surfaced that required attention.

  • Although Flink CDC source code was open under Apache License, Version 2.0, the copyright of the project belonged to Ververica, not to a neutral foundation.
  • The project name, CDC Connectors for Apache Flinkstrong>, was abbreviated to “Flink CDC;” however, the project was not directly affiliated with Apache Flink, leading to potential copyright concerns.

To address these concerns and remove any hesitation to participate in the Flink CDC open source community, we opted to donate the project. During Flink Forward Asia in December 2023, Jark Wu (the visionary behind Flink CDC) announced Ververica's intent to donate Flink CDC to the Apache Foundation as a sub-project of Apache Flink.

Thanks to the collaborative endeavors of the Flink CDC community developers, particularly community maintainers, the donation process has officially concluded.

Changes to access and process

Now that the donation process is complete, the original code repository and documentation website will no longer be used. The project now resides in a new GitHub repository and documentation site, aligning with Apache Flink's community standards and development practices.

As a sub-project of Flink, the subsequent development of Flink CDC will strictly follow the Apache Flink community processes, including:

  • The management of work items and defects through Github issues will be migrated to Flink Jira.
  • Community development discussions and exchanges will transfer gradually to the Flink community mailing list.

See the Apache Flink Community page and Developer Guide for more information and to join discussions in the Flink mailing list.

Repositioning Flink CDC within the data integration framework<

As an independent sub-project of Apache Flink, Flink CDC is no longer limited to providing Flink source connectors. Instead, it can deliver complete end-to-end integration capabilities.

The new framework provides a brand-new API, designed specifically for data integration scenarios. Users can launch a synchronization pipeline by specifying information from external systems on the source and sink side, without understanding the internals of Apache Flink.

As a sub-project of Apache Flink, Flink CDC can cooperate and integrate more deeply with the Flink runtime, and leverage the dominant position of Apache Flink in the stream processing area to provide a high-performance and real-time data integration experience.

Looking forward, the Flink CDC community will continue to improve on the ecosystem by:

  • Supporting transform operations
  • Expanding the connector ecosystem to support more systems like Apache Kafka, Apache Pulsar, and Apache Paimon

Advanced CDC with Ververica Cloud

As the leader in streaming data technologies, founded by the original creators of open-source Apache Flink®, Ververica has developed advanced CDC capabilities with its flagship product, Ververica Cloud.

The Ververica Runtime Assembly (VERA) is an optimized Flink stream processing engine that seamlessly integrates with Flink CDC, and is an integral part of Ververica Cloud.

VERA supports advanced CDC capabilities such as:

  • data and table merging
  • schema evolution
  • full database synchronization
  • synchronization at the level of thousands of tables

It also optimizes performance (up to 2x faster than Apache Flink), provides dynamic complex event processing (CEP), and extends the ecosystem of connectors.

While both Ververica Cloud CDC and Flink open-source CDC use change data capture technologies, the key differences between them are performance, architecture, and ease of use and seamless operations.

Flink open-source CDC requires manual setup and configuration and places the responsibility of managing infrastructure and scaling on users. This solution works best for users who have deep knowledge and expertise with Flink’s ecosystem.

On the other hand, Veverica Cloud CDC offers additional features and enhancements beyond the core Flink CDC. It abstracts complexities related to deployment, configuration, infrastructure management, scalability, fault tolerance, and maintenance, making it more accessible for users who prioritize ease of use and streamlined workflows.

For more information about Ververica CDC and its features, use cases, and practical implementation, refer to the Veverica documentation and blog posts or contact us.

The future of Flink CDC

Flink CDC is entering a new chapter with enhanced focus and capabilities, benefiting from Apache's governance, community, and ecosystem support.

As Alexander Walden, CEO of Ververica, put it, “The donation of Flink CDC to the Apache Software Foundation marks a milestone in our journey towards building a more open, collaborative, and innovative data integration landscape. We are excited to see Flink CDC flourish within the Apache ecosystem, benefiting from and contributing to the collective wisdom and effort of the global developer community. This step reinforces our commitment to open source and the belief that together, we can achieve greater advancements and wider adoption of streaming data technologies."

About Veverica

Veverica enables its customers to unlock the value of their data. Ververica’s comprehensive streaming data platform supports a wide range of options from a fully-managed, cloud-native service (Ververica Cloud) to on-premise software (Ververica Platform).

Founded by the original creators of open-source Apache Flink®, Ververica has the experience and knowledge to continue leading the innovation of streaming data technologies. This leadership is demonstrated through contributions to open-source software projects, an extensive streaming data learning environment (Ververica Academy), and leading the Apache Flink and streaming data conference, Flink Forward. Discover more at www.ververica.com.