Flink Forward Session Preview: Streaming Event-Time Partitioning With Apache Flink and Apache Iceberg at Netflix
Are you thinking of joining Flink Forward Europe 2019? As a first-time speaker at the event, I am excited to meet the Apache Flink community and share how we leveraged Apache Flink and Apache Iceberg (Incubating) at Netflix to build streaming event-time partitioning at scale. In this post, I will give a sneak preview of my talk Streaming Event-Time Partitioning With Apache Flink and Apache Iceberg that you can attend October 8, 2019!
If you haven’t done so already, go ahead and register to learn more from exciting use cases and technical deep dives on October 8 & 9, or attend one of the Apache Flink training sessions scheduled on October 7!
Here’s what my talk will be about at this year’s Flink Forward Europe:
Streaming Event-Time Partitioning With Apache Flinkand Apache Iceberg
Background
At Netflix, we’ve seen a lot of success and also valuable learnings building some of our core data pipelines with near real-time stream processing in Flink. One challenge that we hadn’t yet tackled though was landing data directly from a stream into a table partitioned on event time. Instead, we were often relying on post-processing batch jobs that “put the data in the right place”.
Our playback data, in particular, depends heavily on a repartitioning pattern to handle a long tail of late-arriving events. This partitioning ensures that all playbacks that occurred on any given date can easily be queried together, regardless of what time we actually processed those events. This is a critical dataset with high volume used broadly across Netflix, powering product experiences, A/B test metrics, and offline insights.
With the introduction of Apache Iceberg as a new table format being developed at Netflix and a supporting internal Flink integration, we decided it was time to give streaming event-time partitioning a shot. As an important dataset with a growing number of low latency requirements, playback data was the perfect candidate to put it to the test.
Topics covered
In this talk, I will focus on what you do with your data after your streaming pipelines by discussing a pattern for performing event-time partitioning in Flink on high volume streams.
I’ll give an overview of how we process playback data at Netflix and why event-time partitioning is so important. I’ll briefly introduce Iceberg and how it helps solve this problem, and then describe our implementation of generic event-time partitioning on streams through configurable Flink components that leverage Iceberg as an output table format. And of course, discuss how we applied this generic implementation to improve playback data! I’ll share some of the challenges we hit along the way and talk through tradeoffs with this approach.
Key takeaways
Interested in what you will learn from my session? Here are some key takeaways the Apache Flink community can expect:
-
A generic pattern for event-time partitioning at scale with Flink and Iceberg
-
Challenges and tradeoffs with a streaming approach to event-time partitioning
-
Overview of how we process Netflix’s core playback data and learnings from moving it to this pattern
-
A brief introduction to Apache Iceberg as a new table format and how it integrates with Flink
Make sure to secure your spot before September 15 by registering on the Flink Forward website. Sessions this year cover multiple aspects of stream processing with Apache Flink such as use cases, technology deep dives, ecosystem, operations, community and research talks among others! Grab your pass and come to meet the Apache Flink community in Berlin this October!
You may also like
Your Cloud, Your Rules: Ververica's Bring Your Own Cloud Deployment
Explore Ververica's new BYOC deployment option that balances flexibility,...
From Kappa Architecture to Streamhouse: Making the Lakehouse Real-Time
From Kappa to Lakehouse and now Streamhouse, explore how each help addres...
Fluss Is Now Open Source
Fluss, a real-time streaming storage system for data analytics, is now op...
Announcing Ververica Platform: Self-Managed 2.14
Discover the latest release of Ververica Platform Self-Managed v.2.14, in...