Troubleshooting Apache Flink with Byteman
Introduction
What would you do if you need to see more details of some Apache Flink application logic at runtime, but there's no logging in that code path? An option is modifying the Flink source code or the application code, recompiling and redeploying it, which is time-consuming and error-prone. A quicker and more straightforward approach is to use Byteman. It can inject Java code into JVM and retrieve the runtime details you need.
What is Byteman
Byteman is a tool that makes it easy to trace, monitor, and test Java applications and JDK runtime code behavior. It can inject Java code into the application methods or into Java runtime methods without the need for you to recompile, repackage or even redeploy your application. The injected code can access any of your data and call any application methods, including the ones that are private.
To fully unleash the power of Byteman, you can use a simple scripting language based on a formalism called Event Condition Action (ECA) rule to specify where, when, and how the original Java code should be transformed. A rule specifies a trigger point and a location where you want code to be injected. When the execution reaches the trigger point, the rule's condition, a Java boolean expression, is evaluated. The Java expression (or sequence of expressions) in the rule's action is executed only when the condition is true.
In the next section, I will use an example to show how to leverage Byteman to retrieve more details of the underlying logic within a Flink application.
Apache Flink Troubleshooting Case Study
Purpose
Checkpointing is the fault tolerance mechanism in the Apache Flink framework. When using S3 as the checkpointing destination, Flink usually leverages the Hadoop or Presto libraries for any underlying communication. However, it can sometimes be challenging to troubleshoot issues in that code path because the 3rd-party libraries don’t always contain sufficient logging. In this example, I’ll demonstrate some of Byteman’s capabilities on the logging code injection in a Flink application running on Ververica Platform.
Preparation
Download the latest version (which is 4.0.15 at the time of this writing) of Byteman from the official website. After decompression, you can find the required byteman.jar in the byteman-download-4.0.15/lib directory.
the bytemand-install.jar and byteman-submit.jar in the same directory are not sufficient for this use case.
Rules
To achieve the goal, let’s write some rules for Byteman in a plain text file (the lines start with ‘#’ are comments):
# File name: rules_v1.btm
# Start of a rule (naming the rule)
RULE rule_example_1
# Target class
CLASS ^org.apache.flink.fs.s3.common.FlinkS3FileSystem
# Target method (i.e. the constructor method)
METHOD
# Injection position in the method (e.g. ENTRY/EXIT/LINE number/...)
AT ENTRY
# Bind a parameter for logging (same as a local variable)
BIND myLOG:org.slf4j.Logger = org.slf4j.LoggerFactory.getLogger($0.getClass());
# Trigger condition (no need in this case, so I put 'true')
IF true
# Actions to take when the rule got triggered (print the total number of parameters for the constructor method together with the values for the first and fifth parameters)
DO myLOG.info("Hello, FlinkS3FileSystem! -- Byteman");
myLOG.info("Total number of parameters: " + $#);
myLOG.info("#1 hadoopS3FileSystem: " + $1);
myLOG.info("#5 S3AccessHelper: " + $5);
# End of rule
ENDRULE
# ----------------------------------
# Another rule. (A single byteman rule file can contain multiple rule definitions.)
RULE rule_example_2
CLASS ^com.facebook.presto.hive.s3.PrestoS3FileSystem
METHOD initialize
AT EXIT
BIND myLOG:org.slf4j.Logger = org.slf4j.LoggerFactory.getLogger($0.getClass());
IF true
DO myLOG.info("Hello, PrestoS3FileSystem! -- Byteman");
myLOG.info("AmazonS3: [" + $0.s3 + "]");
myLOG.info("TransferManagerConfiguration: [" + $0.transferConfig + "]");
ENDRULE
# End of file rules_v1.btm
The full explanation of the rule language can be found here.
Ververica Platform Configuration
The version of Veverica Platform in this demo is 2.5.0 and the corresponding Flink version is Flink 1.13.1. The Flink application in the demo is Top Speed Windowing from the Ververica Platform documentation.
Firstly, we need to upload both the byteman.jar and the rules file to Ververica Platform as two artifacts:
Then, the next step is to add both files to the Additional Dependencies and configure env.java.opts for the Apache Flink application:
env.java.opts: >-
-javaagent:/flink/usrlib/byteman.jar=script:/flink/usrlib/rules_v1.btm,boot:/flink/usrlib/byteman.jar
Results
After implementing the above changes, the Deployment will automatically restart. To check the results, click the Flink UI button at the top of the Deployment page and then click the Task Manager tab on the left of the Flink page.
We should be able to see the TM logging after going into the TM page and clicking the Logs tab. Now the customized logging information will be shown after searching the ‘Byteman’ in the logging area:
Summary
Byteman is a very powerful tool that can diagnose most Java-related application issues. If you are willing to learn quickly how to best utilize Byteman, please refer to this quick tutorial. It can save you many hours when troubleshooting complex issues in your application, especially in cases when other troubleshooting methods might have failed. This article provided only a brief introduction to Byteman, and the example above only reveals a tiny fraction of its capabilities. To get more information, please check the Byteman website. For additional troubleshooting and debugging tips make sure to check our Ververica Troubleshooting & Operations training and sign up for the next available training date on our website.
From Kappa Architecture to Streamhouse: Making the Lakehouse Real-Time
From Kappa to Lakehouse and now Streamhouse, explore how each help addres...
Fluss Is Now Open Source
Fluss, a real-time streaming storage system for data analytics, is now op...
Announcing Ververica Platform: Self-Managed 2.14
Discover the latest release of Ververica Platform Self-Managed v.2.14, in...
Real-Time Insights for Airlines with Complex Event Processing
Discover how Complex Event Processing (CEP) and Dynamic CEP help optimize...