[Bug] Job cannot recover from checkpoint/savepoint if parallelism is changed from 1 to 2 #4543

Tan-JiaLiang · 2024-11-18T09:22:34Z

Search before asking

I searched in the issues and found nothing similar.

Paimon version

0.9.0

Compute Engine

Flink

Minimal reproduce step

start a job to write paimon append only table in parallelism=1.
stop the job.
restore the job with checkpoint, and change the job's parallelism=2.
error appear, job can not restore from checkpoint.

What doesn't meet your expectations?

Job can restore from checkpoint/savepoint even if I change the parallelism.

Anything else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

Tan-JiaLiang · 2024-11-18T09:26:17Z

paimon/paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/sink/UnawareBucketSink.java

Lines 69 to 83 in 220789d

    
           if (enableCompaction && isStreamingMode) { 
        
               written = 
        
                       written.transform( 
        
                                       "Compact Coordinator: " + table.name(), 
        
                                       new EitherTypeInfo<>( 
        
                                               new CommittableTypeInfo(), 
        
                                               new CompactionTaskTypeInfo()), 
        
                                       new AppendBypassCoordinateOperatorFactory<>(table)) 
        
                               .forceNonParallel() 
        
                               .transform( 
        
                                       "Compact Worker: " + table.name(), 
        
                                       new CommittableTypeInfo(), 
        
                                       new AppendBypassCompactWorkerOperator(table, initialCommitUser)) 
        
                               .setParallelism(written.getParallelism()); 
        
           }

If the job parallelism is 1, the Writer operator and the Compact Coordinator operator will be chained. However, since the parallelism of the Compact Coordinator operator is always 1, when the job parallelism is adjusted, the Writer operator and the Compact Coordinator operator will be separated, resulting in the state not being recoverable.

We need to disable chain between writer operator and compact corrdinator operator.

Tan-JiaLiang · 2024-11-18T09:27:42Z

paimon/paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/sink/FlinkSink.java

Lines 243 to 256 in 220789d

    
           if (options.get(CHANGELOG_PRECOMMIT_COMPACT)) { 
        
               written = 
        
                       written.transform( 
        
                                       "Changelog Compact Coordinator", 
        
                                       new EitherTypeInfo<>( 
        
                                               new CommittableTypeInfo(), new ChangelogTaskTypeInfo()), 
        
                                       new ChangelogCompactCoordinateOperator(table)) 
        
                               .forceNonParallel() 
        
                               .transform( 
        
                                       "Changelog Compact Worker", 
        
                                       new CommittableTypeInfo(), 
        
                                       new ChangelogCompactWorkerOperator(table)) 
        
                               .setParallelism(written.getParallelism()); 
        
           }

Same as org.apache.paimon.flink.sink.FlinkSink#doWrite

Tan-JiaLiang · 2024-11-18T09:45:29Z

@JingsongLi WDYT? Should we add an option to control this? Like #3232.

Tan-JiaLiang · 2024-11-19T02:25:00Z

I think #4424 can solve this problem. But do we still need to add the disable chain? Or do we just need to recommend that users add the 'sink.operator-uid.suffix' and 'source.operator-uid.suffix' options to their Flink job?

Tan-JiaLiang added the bug Something isn't working label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Job cannot recover from checkpoint/savepoint if parallelism is changed from 1 to 2 #4543

[Bug] Job cannot recover from checkpoint/savepoint if parallelism is changed from 1 to 2 #4543

Tan-JiaLiang commented Nov 18, 2024

Tan-JiaLiang commented Nov 18, 2024

Tan-JiaLiang commented Nov 18, 2024

Tan-JiaLiang commented Nov 18, 2024

Tan-JiaLiang commented Nov 19, 2024

[Bug] Job cannot recover from checkpoint/savepoint if parallelism is changed from 1 to 2 #4543

[Bug] Job cannot recover from checkpoint/savepoint if parallelism is changed from 1 to 2 #4543

Comments

Tan-JiaLiang commented Nov 18, 2024

Search before asking

Paimon version

Compute Engine

Minimal reproduce step

What doesn't meet your expectations?

Anything else?

Are you willing to submit a PR?

Tan-JiaLiang commented Nov 18, 2024

Tan-JiaLiang commented Nov 18, 2024

Tan-JiaLiang commented Nov 18, 2024

Tan-JiaLiang commented Nov 19, 2024