3 important performance factors for stateful functions and operators
This post focuses on the 3 factors developers should keep in mind when assessing the performance of a function or operator that uses Flink’s Keyed State in a stateful streaming application.
Keyed state is one of the two basic types of state in Apache Flink, the other being Operator state. As its name suggests, keyed state is bound to keys and is only available to functions and operators that process data from a KeyedStream. The difference between operator and keyed state is that operator state is scoped per parallel instance of an operator (sub-task), while keyed state is partitioned or sharded based on exactly one state-partition per key.
For more information about State in the Apache Flink, the documentation section “Working with State” describes how to use Flink’s state abstractions when developing an application.
3 factors that impact the performance of Keyed State in Apache Flink
With this in mind, let’s discuss below 3 factors that will impact the performance of Flink’s keyed state that you should keep in mind while developing stateful streaming applications:
The chosen state backend
The selected state backend has the most dominant impact on the performance of your stateful function or operation in a Flink application. The most distinctive factor here is how each state backend handles serialization of your state for persistence differently.
For example, when using the FsStateBackend or MemoryStateBackend, local state is maintained as on-heap objects during runtime and therefore has low overhead when accessing or updating them. Serialization overheads only occur when snapshots of the state are taken to create Flink checkpoints or savepoints. Downside to using this state backend is that state size is limited by heap size of the JVM, and can potentially run into OutOfMemory errors or long pauses for garbage collection.
On the contrary, out-of-core state backends such as the RocksDBStateBackend allows much larger state sizes by maintaining local state on disk. The tradeoff is that every state read and write would require serialization / deserialization.
To wrap this up, if your state size is small and expected to not exceed heap size, then using on-heap backends would be the obvious choice as it avoids serialization overheads. Otherwise, currently, RocksDBStateBackend would be the go-to choice for applications with large state sizes.
Per-key state primitives
ValueState / ListState / MapState
Another important factor is knowing to choose the correct state primitives. Flink currently supports 3 main state primitives for keyed state: ValueState, ListState, and MapState.
One common mistake new developers to Flink might make is having as state, for example, a ValueState<Map<String, Integer>> while the map entries are intended to only be randomly accessed. In this case, it is definitely better to use MapState<String, Integer>, especially when taking into account that out-of-core state backends, such as RocksDBStateBackend, serializes/deserializes ValueState states completely on access, while for MapState, serialization occurs per-entry.
Access Pattern
Following up the previous section about state primitives, it was already quite thoroughly hinted that assessing how your application logic accesses state will help determine what state structure you should be using. As developers should always expect when designing any kind of application, using unsuitable data structures for your application’s specific data access pattern can have a severe impact on overall performance.
Conclusion
Developers should consider all three factors above as they can affect the performance of stateful functions and operators in Flink to a great extent. We encourage you to check our Advanced Training Schedule below for more best practices and application design patterns for stateful streaming applications.
From Kappa Architecture to Streamhouse: Making the Lakehouse Real-Time
From Kappa to Lakehouse and now Streamhouse, explore how each help addres...
Fluss Is Now Open Source
Fluss, a real-time streaming storage system for data analytics, is now op...
Announcing Ververica Platform: Self-Managed 2.14
Discover the latest release of Ververica Platform Self-Managed v.2.14, in...
Real-Time Insights for Airlines with Complex Event Processing
Discover how Complex Event Processing (CEP) and Dynamic CEP help optimize...