Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-37375][checkpointing] Checkpoint supports the Operator to cust… #26330

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hejufang
Copy link

@hejufang hejufang commented Mar 20, 2025

What is the purpose of the change

Checkpoint supports the Operator to customize asynchronous operation, thereby minimizing the blocking of the main thread and improving task throughput.

Brief change log

Add asyncOperate interface for checkpoint. Users can implement this interface to define custom asynchronous operation processing logic.

Verifying this change

This change added tests and can be verified as follows:
CheckpointWithAsyncOperateITCase

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 20, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build


/**
* This method is called when a snapshot for a checkpoint is requested. Execution of this method
* does not block the main thread, is performed in an asynchronous thread pool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: , is seems to have a missing word. maybe , as it is

* This method is called when a snapshot for a checkpoint is requested. Execution of this method
* does not block the main thread, is performed in an asynchronous thread pool.
*
* @param context the context for drawing a snapshot of the operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what drawing means here. Maybe simplify to the context from which a snapshot of the operator is created

* does not block the main thread, is performed in an asynchronous thread pool.
*
* @param context the context for drawing a snapshot of the operator
* @throws Exception Thrown and task will be failed, if state could not be created ot restored.
Copy link
Contributor

@davidradl davidradl Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe rephrase to be a little clearer.

  • If you don't want to task fail, can throw a RetriableAsyncOperateException, I am not sure who the you is here.

Would the following be correct:

A RetriableAsyncOperateException thrown by this method indicates that the checkpoint has failed but the Async checkpoint operation can be retried. Other Exceptions indicate that the the checkpoint should not be retried so the task will fail.

Furthermore I think it would be better to not throw Exception on these methods, all the specific Exception types that are likely to be thrown should be explicitly on these methods to the contract is explicit. Then we can detail the appropriately for each Exception type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants