Elastic Data Processing with Apache Flink and Apache Pulsar

Written by Sijie Guo | 28 February 2019

Excited for Flink Forward San Francisco 2019? As one of the 30+ conference speakers, I want to give a sneak preview of my upcoming Flink Forward talk: Elastic Data Processing with Apache Flink and Apache Pulsar.

Flink Forward returns to San Francisco for the third year in a row to showcase the latest developments around Apache Flink and the stream processing ecosystem. This year, it introduces exciting use cases to the Flink and stream processing communities. If you haven’t done so, please go ahead and register to find out the latest stream processing developments!

Here’s what you can expect from my talk during Flink Forward this year:

Elastic Data Processing with
Apache Flink and Apache Pulsar

Background

As fast data needs continue to expand, the adoption of stream computing as a framework providing low latency data processing is increasing by the day. Computing frameworks like Apache Flink unify batch and stream processing into one single computing engine with “streams” as the unified data representation in mind. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers.

In reality, we are still in a world where data is segregated into data silos, created by various storage and messaging technologies. The two main different types of data are still stored in very different ways - software engineers use message queues and log storage systems to store near-real-time events data, while using filesystems and object stores to store static data for batch processing. This means that even when you have a unified computing engine, data scientists still need to write programs to process different sets of data from data silos, and your SRE teams have to operate on two different sets of data. As a result, there is no single source-of-truth and it the overall operation for the developer teams is still messy.

Flink addresses the problem by standardizing the computation in a “stream” way, where everything is treated as “streams”. Batch processing is just a special case of stream processing, processing a bounded stream.

Similarly, we addressed the messy operationalisation by storing data in streams. The data only stores one copy (source-of-truth) and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). I called this as “segmented-streams”. This approach has been applied in many Apache BookKeeper-based data systems, such as Twitter’s EventBus, EMC’s Pravega and Apache Pulsar.

While Flink unifies computation in a “stream” way, Apache Pulsar (together with Apache BookKeeper) unifies data in a “stream” way. Together the combination of both can create a unified data architecture for serving many data-driven businesses.

Topics covered

In this presentation, I will be talking about this “segmented-streams” concept and architecture. Using Apache Pulsar as an example I will be explaining how various Apache BookKeeper-based systems are built using this “segmented-streams” concept, and how Apache Flink can integrate such systems for elastic batch and stream processing over segmented streams.

Finally, I will be explaining how we integrate Apache Pulsar and Apache Flink for streaming and batch connector, and how Flink can leverage the built-in schema management in Apache Pulsar.

Key takeaways

I hope you find the topic of my talk as exciting as and interesting as I am about it! I hope attendees will enjoy my talk and learn the following aspects:

What are “segmented-streams”? And what is Apache Pulsar?
Why a “segmented-streams” system (Apache Pulsar) fits better with Apache Flink for elastic batch and stream processing?
How do we integrate Pulsar and Flink? What are the challenges?
What is the future roadmap for Pulsar’s and Flink’s integration?

If you are interested in more sessions about how Flink integrates with the data processing ecosystem and technologies such as deployment and resource management frameworks (e.g. DC/OS, Kubernetes, YARN), message queues (e.g. Apache Kafka, Amazon Kinesis, Apache Pulsar), databases (e.g. Apache Cassandra, Redis), durable storage or logging and metrics, some of the talks below might interest you as well:

Integrate Flink with Hive Ecosystem by Xuefu Zhang, Alibaba
Deploying ONNX models on Flink by Isaac Mckillen-Godfried, AI Stream

Don’t forget to register before March 23 to secure your spot and immerse yourself in the exciting world of stream processing and Apache Flink! See you in San Francisco in a few weeks!

View full post