Troubleshooting Apache Flink with Byteman

Introduction

What would you do if you need to see more details of some Apache Flink application logic at runtime, but there's no logging in that code path? An option is modifying the Flink source code or the application code, recompiling and redeploying it, which is time-consuming and error-prone. A quicker and more straightforward approach is to use Byteman. It can inject Java code into JVM and retrieve the runtime details you need.

What is Byteman

Troubleshooting Apache Flink, Byteman-png

Byteman is a tool that makes it easy to trace, monitor, and test Java applications and JDK runtime code behavior. It can inject Java code into the application methods or into Java runtime methods without the need for you to recompile, repackage or even redeploy your application. The injected code can access any of your data and call any application methods, including the ones that are private.

To fully unleash the power of Byteman, you can use a simple scripting language based on a formalism called Event Condition Action (ECA) rule to specify where, when, and how the original Java code should be transformed. A rule specifies a trigger point and a location where you want code to be injected. When the execution reaches the trigger point, the rule's condition, a Java boolean expression, is evaluated. The Java expression (or sequence of expressions) in the rule's action is executed only when the condition is true.

In the next section, I will use an example to show how to leverage Byteman to retrieve more details of the underlying logic within a Flink application.

Apache Flink Troubleshooting Case Study

Purpose

Checkpointing is the fault tolerance mechanism in the Apache Flink framework. When using S3 as the checkpointing destination, Flink usually leverages the Hadoop or Presto libraries for any underlying communication. However, it can sometimes be challenging to troubleshoot issues in that code path because the 3rd-party libraries don’t always contain sufficient logging. In this example, I’ll demonstrate some of Byteman’s capabilities on the logging code injection in a Flink application running on Ververica Platform.

Preparation

Download the latest version (which is 4.0.15 at the time of this writing) of Byteman from the official website. After decompression, you can find the required byteman.jar in the byteman-download-4.0.15/lib directory.

the bytemand-install.jar and byteman-submit.jar in the same directory are not sufficient for this use case.

Rules

To achieve the goal, let’s write some rules for Byteman in a plain text file (the lines start with ‘#’ are comments):


# File name: rules_v1.btm
# Start of a rule (naming the rule)
RULE rule_example_1

# Target class
CLASS ^org.apache.flink.fs.s3.common.FlinkS3FileSystem

# Target method (i.e. the constructor method)
METHOD 

# Injection position in the method (e.g. ENTRY/EXIT/LINE number/...)
AT ENTRY

# Bind a parameter for logging (same as a local variable)
BIND myLOG:org.slf4j.Logger = org.slf4j.LoggerFactory.getLogger($0.getClass());

# Trigger condition (no need in this case, so I put 'true')
IF true

# Actions to take when the rule got triggered (print the total number of parameters for the constructor method together with the values for the first and fifth parameters)
DO myLOG.info("Hello, FlinkS3FileSystem! -- Byteman");
   myLOG.info("Total number of parameters: " + $#);
   myLOG.info("#1 hadoopS3FileSystem: " + $1);
   myLOG.info("#5 S3AccessHelper: " + $5);

# End of rule
ENDRULE

# ----------------------------------
# Another rule. (A single byteman rule file can contain multiple rule definitions.)

RULE rule_example_2

CLASS ^com.facebook.presto.hive.s3.PrestoS3FileSystem

METHOD initialize

AT EXIT

BIND myLOG:org.slf4j.Logger = org.slf4j.LoggerFactory.getLogger($0.getClass());

IF true

DO myLOG.info("Hello, PrestoS3FileSystem! -- Byteman");
   myLOG.info("AmazonS3: [" + $0.s3 + "]");
   myLOG.info("TransferManagerConfiguration: [" + $0.transferConfig + "]");

ENDRULE
# End of file rules_v1.btm

The full explanation of the rule language can be found here.

Ververica Platform Configuration

The version of Veverica Platform in this demo is 2.5.0 and the corresponding Flink version is Flink 1.13.1. The Flink application in the demo is Top Speed Windowing from the Ververica Platform documentation.

Firstly, we need to upload both the byteman.jar and the rules file to Ververica Platform as two artifacts:

Ververica Platform artifacts, Flink, Apache Flink Artifacts, Flink Configuration

Then, the next step is to add both files to the Additional Dependencies and configure env.java.opts for the Apache Flink application:

top speeding window, flink, apache flink, ververica platform

top speeding window, flink, apache flink, ververica platform, number 2-jpg

Note that the references to byteman files in env.java.opts must be absolute paths. Otherwise, it will fail. In VVP Deployment, the default local path for additional dependencies is /flink/usrlib, so the complete configuration looks like below:

env.java.opts: >-
-javaagent:/flink/usrlib/byteman.jar=script:/flink/usrlib/rules_v1.btm,boot:/flink/usrlib/byteman.jar

Results

After implementing the above changes, the Deployment will automatically restart. To check the results, click the Flink UI button at the top of the Deployment page and then click the Task Manager tab on the left of the Flink page.

Flink UI, top speeding window, Flink, byteman, Flink troubleshooting, Flink debugging

flink task manager, apache flink, flink debugging, flink troubleshooting, flink UI

We should be able to see the TM logging after going into the TM page and clicking the Logs tab. Now the customized logging information will be shown after searching the ‘Byteman’ in the logging area:

flink task manager, apache flink, flink debugging, flink troubleshooting, flink UI,n2

Summary

Byteman is a very powerful tool that can diagnose most Java-related application issues. If you are willing to learn quickly how to best utilize Byteman, please refer to this quick tutorial. It can save you many hours when troubleshooting complex issues in your application, especially in cases when other troubleshooting methods might have failed. This article provided only a brief introduction to Byteman, and the example above only reveals a tiny fraction of its capabilities. To get more information, please check the Byteman website. For additional troubleshooting and debugging tips make sure to check our Ververica Troubleshooting & Operations training and sign up for the next available training date on our website.

New call-to-action

New call-to-action

VERA white papre

Sign up for Monthly Blog Notifications