Stream Processing & Apache Flink - News and Best Practices

Scaling a real-time streaming with Apache Flink Parquet and Kubernetes

Written by Ramesh Shanmugam | 30 March 2019

Authors: Ramesh Shanmugam & Aditi Verma

Flink Forward San Francisco is a couple of days away! In case you haven’t booked your tickets yet, here’s a sneak preview of our session Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes, on April 2, 2019, to give you some more insight into what you can expect at the conference next week.

If you haven’t registered already, make sure to book your last minute tickets while they last! Spots are limited so hurry up to secure your place at Flink Forward and learn more about the exciting world of Apache Flink!

Scaling a real-time streaming warehouse
with Apache Flink, Parquet and Kubernetes

Background

Branch is the industry-leading mobile measurement and deep linking platform. For this, we process more than 20 billion events and store several terabytes of data per day.

In this talk, we cover our learnings and challenges running and scaling an Apache Flink Parquet warehouse on Kubernetes. We share our challenges around memory management and failure recovery. We also talk in detail about our current Apache Flink infrastructure, recovery and auto-scaling mechanisms.

Topics covered

This talk covers a detailed overview of our challenges around writing columnar file formats with Flink. We also talk about the decisions taken and learnings around migrating Flink jobs from Mesos on Kubernetes. Then we talk about auto scaling Flink jobs on Kubernetes, as well as efficiently handling failure scenarios

Key takeaways

  • Learnings from running Apache Flink clusters on Mesos and Kubernetes

  • Takeaways from writing Parquet files with ApacheFlink

Make sure to secure your spot by registering on the Flink Forward website today. The event includes multiple tracks and it’s a unique opportunity to bring your knowledge and stream processing expertise to the next level! Sessions cover among other Flink use cases, technology deep dives, Apache Flink and stream processing ecosystem talks and deep dives so don’t miss out on the exciting conference schedule!



About the authors:

Aditi Verma

Aditi is a senior software engineer at Branch, working on developing and scaling their data platform, that processes tens of billion events per day. Prior to Branch, she worked at Yahoo to develop data systems that provide actionable insights and audience targeting from petabytes of data. She has a wide range of experience in the data domain, from stream and batch processing to resource management, scaling and monitoring.

Ramesh Shanmugam

Ramesh Shanmugam is a Senior Data Platform Engineer at Branch Metrics. At Branch, currently, he is building streaming and batch pipelines at a huge scale using Apache Flink, Spark, and Airflow. He has been creating distributed applications for more than 15 years. Passionate about building data-intensive applications.