[FLINK-36061][iceberg] Add Iceberg Sink. #3904

lvyanquan · 2025-02-05T01:40:00Z

Add Iceberg DataSink.
Notice:
This PR does not include the logic of automatically compacting small files.

Coauthor with @czy006 that based on #3877.

SML0127

Impressive and long-awaited work! I left some minor comments and questions. If possible, I'd like to contribute as well.

flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-iceberg/pom.xml

...eberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/utils/IcebergTypeUtils.java

SML0127 · 2025-02-22T06:47:43Z

...-iceberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/v2/IcebergCommitter.java

+            if (dataFiles.isEmpty() && deleteFiles.isEmpty()) {
+                LOGGER.info(String.format("Nothing to commit to table %s, skipping", table.name()));
+            } else {
+                if (deleteFiles.isEmpty()) {
+                    AppendFiles append = table.newAppend();
+                    dataFiles.forEach(append::appendFile);
+                    append.commit();
+                } else {
+                    RowDelta delta = table.newRowDelta();
+                    dataFiles.forEach(delta::addRows);
+                    deleteFiles.forEach(delta::addDeletes);
+                    delta.commit();
+                }


Iceberg supports empty commit to revmoe old tmp files and clean some data from flink state (apache/iceberg#6630). I'd like to know if a slimilar issue exists.

SML0127 · 2025-02-22T06:55:04Z

...ceberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/v2/WriteResultWrapper.java

+        if (writeResult.dataFiles() != null) {
+            for (DataFile dataFile : writeResult.dataFiles()) {
+                addCount += dataFile.recordCount();
+            }
+        }
+        long deleteCount = 0;
+        if (writeResult.deleteFiles() != null) {
+            for (DeleteFile dataFile : writeResult.deleteFiles()) {
+                deleteCount += dataFile.recordCount();
+            }
+        }


It would be nice if we could also provide some metrics (e.g. numRecordsInDataFiles, numDataFiles) too.

SML0127 · 2025-02-22T07:12:06Z

...eberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/IcebergDataSinkOptions.java

+    public static final ConfigOption<String> PARTITION_KEY =
+            key("partition.key")
+                    .stringType()
+                    .defaultValue("")
+                    .withDescription(
+                            "Partition keys for each partitioned table, allow setting multiple primary keys for multiTables. "
+                                    + "Tables are separated by ';', and partition keys are separated by ','. "
+                                    + "For example, we can set partition.key of two tables by 'testdb.table1:id1,id2;testdb.table2:name'.");


Currently, there is no issue since only identity is supported. However, I'm concerned that it may become more complex when other transforms, such as day, hour, bucket(n), truncate(w) are supported in the future.

Sure, we should support more partition transform function.
However, a thorny issue is how to configure this transformation, as it takes effect on fields, and we usually have many tables and fields that are too complex to configure one by one. We need a more elegant configuration scheme.

SML0127 · 2025-02-22T07:16:32Z

...eberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/IcebergDataSinkOptions.java

+import static org.apache.flink.cdc.common.configuration.ConfigOptions.key;
+
+/** Config options for {@link IcebergDataSink}. */
+public class IcebergDataSinkOptions {


How about adding options for name of catalog and namespace?

I don't quiet understand what kind of options, could you explain more about this?

I wonder what you think about receiving catalog-name as a separate option.
And I understand that the current code separates the namespace and table name based on dot This method is also possible, but I would like to ask for opinions on providing the namespace name as a separate option.

Suggested change

public class IcebergDataSinkOptions {

public class IcebergDataSinkOptions {

public static final ConfigOption<String> CATALOG_NAME =

key("catalog.name")

.stringType()

.noDefaultValue()

.withDescription("The name of iceberg catalog to use");

public static final ConfigOption<String> NAMESPACE_NAME =

key("namespace.name")

.stringType()

.noDefaultValue()

.withDescription("The name of iceberg namespace to use");

TableIdentifier.java

public static TableIdentifier of(Namespace namespace, String name) { return new TableIdentifier(namespace, name); } public static TableIdentifier parse(String identifier) { Preconditions.checkArgument(identifier != null, "Cannot parse table identifier: null"); Iterable<String> parts = DOT.split(identifier); return TableIdentifier.of(Iterables.toArray(parts, String.class)); }

lvyanquan · 2025-03-26T01:39:20Z

Hi @leonardBang, could you help to review this?

yuxiqian · 2025-03-26T09:04:07Z

Hi @lvyanquan, thanks for your contribution! Could you please rebase this PR with latest master when available?

Code style verifier has been updated to enforce JUnit 5 + AssertJ framework and these classes might need to be migrated:

JUnit 4 style test annotations should be changed to JUnit 5 equivalents
- org.junit.Test => org.junit.jupiter.api.Test
- @Before, @BeforeClass => @BeforeEach, @BeforeAll
- @After, @AfterClass => @AfterEach, @AfterAll
JUnit Assertions / Hamcrest Assertions are not allowed, including:
- org.junit.Assert
- org.junit.jupiter.api.Assertions
- org.hamcrest.*

org.assertj.core.api.Assertions should be used instead.

Run mvn verify -DskipTests locally to check if all these requirements have been satisfied.

ruanhang1993 · 2025-03-27T07:39:18Z

...ine-e2e-tests/src/test/java/org/apache/flink/cdc/pipeline/tests/MySqlToIcebergE2eITCase.java

+                                + "\n"
+                                + "pipeline:\n"
+                                + "  schema.change.behavior: evolve\n"
+                                + "  parallelism: 1",


Please add a test with larger parallelism.

ruanhang1993 · 2025-03-27T08:05:31Z

...main/java/org/apache/flink/cdc/connectors/iceberg/sink/v2/compaction/CompactionOperator.java

+            if (commitTimes >= compactionOptions.getCommitInterval()
+                    && !compactedTables.contains(tableId)) {
+                if (throwable != null) {
+                    throw new RuntimeException(throwable);


Do we need to move this if statement to the start of this method processElement out of if (element.getValue() instanceof CommittableWithLineage)?

ruanhang1993 · 2025-03-27T08:07:09Z

...eberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/utils/IcebergTypeUtils.java

+
+import static org.apache.flink.cdc.common.types.DataTypeChecks.getFieldCount;
+
+/** Util class for {@link IcebergDataSink}. */


Suggested change

/** Util class for {@link IcebergDataSink}. */

/** Util class for types in {@link IcebergDataSink}. */

ruanhang1993 · 2025-03-27T08:07:40Z

...eberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/utils/IcebergTypeUtils.java

+public class IcebergTypeUtils {
+
+    /** Convert column from CDC framework to Iceberg framework. */
+    public static Types.NestedField convertCDCColumnToIcebergField(


Suggested change

public static Types.NestedField convertCDCColumnToIcebergField(

public static Types.NestedField convertCdcColumnToIcebergField(

ruanhang1993 · 2025-03-27T08:12:58Z

...eberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/utils/IcebergTypeUtils.java

+            case TIMESTAMP_WITH_TIME_ZONE:
+                return Types.TimestampType.withZone();
+            default:
+                throw new IllegalArgumentException("Illegal type: " + type);


Suggested change

throw new IllegalArgumentException("Illegal type: " + type);

throw new IllegalArgumentException("Unsupported cdc type in iceberg: " + type);

ruanhang1993 · 2025-03-27T08:23:18Z

...eberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/utils/IcebergTypeUtils.java

+                        };
+                break;
+            case TIMESTAMP_WITH_LOCAL_TIME_ZONE:
+            case TIMESTAMP_WITH_TIME_ZONE:


Is it right that TIMESTAMP_WITH_LOCAL_TIME_ZONE and TIMESTAMP_WITH_TIME_ZONE have the same code?

ruanhang1993 · 2025-03-27T08:30:31Z

...eberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/IcebergDataSinkFactory.java

+    @Override
+    public Set<ConfigOption<?>> requiredOptions() {
+        Set<ConfigOption<?>> options = new HashSet<>();
+        return options;


return

Suggested change

return options;

return new HashSet<>();

ruanhang1993 · 2025-03-27T08:33:52Z

...eberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/IcebergMetadataApplier.java

+        this.catalogOptions = catalogOptions;
+        this.tableOptions = new HashMap<>();
+        this.partitionMaps = new HashMap<>();
+        this.enabledSchemaEvolutionTypes = getSupportedSchemaEvolutionTypes();


Suggested change

this.enabledSchemaEvolutionTypes = getSupportedSchemaEvolutionTypes();

this(catalogOptions, new HashMap<>(), new HashMap<>());

ruanhang1993 · 2025-03-27T08:41:00Z

...eberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/IcebergMetadataApplier.java

+        try {
+            UpdateSchema updateSchema = table.updateSchema();
+            for (AddColumnEvent.ColumnWithPosition columnWithPosition : event.getAddedColumns()) {
+                Column addColumn = columnWithPosition.getAddColumn();


Suggested change

Column addColumn = columnWithPosition.getAddColumn();

Column addColumn = columnWithPosition.getAddColumn();

String columnName = addColumn.getName();

LogicalType logicalType =

FlinkSchemaUtil.convert(

DataTypeUtils.toFlinkDataType(addColumn.getType())

.getLogicalType());

Use columnName and logicalType in following code.

ruanhang1993 · 2025-03-27T08:46:30Z

...tor-iceberg/src/main/java/org/apache/flink/cdc/connectors/iceberg/sink/v2/IcebergWriter.java

+    public void close() throws Exception {
+        if (schemaMap != null) {
+            schemaMap.clear();
+            schemaMap = null;


Will IcebergWriter be reused?

lvyanquan marked this pull request as draft February 5, 2025 01:40

lvyanquan force-pushed the FLINK-36061 branch from 8e9f227 to e6a6d12 Compare February 7, 2025 04:01

github-actions bot added docs Improvements or additions to documentation mysql-cdc-connector e2e-tests labels Feb 7, 2025

lvyanquan force-pushed the FLINK-36061 branch 2 times, most recently from 1d17ae3 to f0d2899 Compare February 8, 2025 11:58

github-actions bot removed the mysql-cdc-connector label Feb 8, 2025

lvyanquan force-pushed the FLINK-36061 branch from 875a336 to 5586437 Compare February 9, 2025 10:38

lvyanquan marked this pull request as ready for review February 10, 2025 01:45

SML0127 reviewed Feb 22, 2025

View reviewed changes

lvyanquan mentioned this pull request Feb 27, 2025

[FLINK-36061] Support Iceberg CDC Pipeline SinkV2 #3877

Open

lvyanquan force-pushed the FLINK-36061 branch 2 times, most recently from 350e437 to 7c5033c Compare March 18, 2025 12:51

github-actions bot removed docs Improvements or additions to documentation e2e-tests labels Mar 18, 2025

lvyanquan force-pushed the FLINK-36061 branch from 7c5033c to 350e437 Compare March 19, 2025 03:44

github-actions bot added docs Improvements or additions to documentation e2e-tests labels Mar 19, 2025

lvyanquan force-pushed the FLINK-36061 branch from 350e437 to b1992a5 Compare March 20, 2025 02:10

lvyanquan marked this pull request as draft March 24, 2025 06:40

lvyanquan force-pushed the FLINK-36061 branch 3 times, most recently from d654b7a to acb0002 Compare March 26, 2025 01:25

github-actions bot removed the docs Improvements or additions to documentation label Mar 26, 2025

lvyanquan marked this pull request as ready for review March 26, 2025 01:26

lvyanquan force-pushed the FLINK-36061 branch from acb0002 to bd3ab02 Compare March 26, 2025 01:27

github-actions bot added the docs Improvements or additions to documentation label Mar 26, 2025

leonardBang self-requested a review March 26, 2025 01:50

[FLINK-36061] Add Iceberg DataSink.

93cddf4

lvyanquan force-pushed the FLINK-36061 branch from bd3ab02 to 93cddf4 Compare March 26, 2025 09:48

ruanhang1993 reviewed Mar 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-36061][iceberg] Add Iceberg Sink. #3904

[FLINK-36061][iceberg] Add Iceberg Sink. #3904

lvyanquan commented Feb 5, 2025 •

edited

Loading

SML0127 left a comment

SML0127 Feb 22, 2025

SML0127 Feb 22, 2025

SML0127 Feb 22, 2025

lvyanquan Mar 12, 2025

SML0127 Feb 22, 2025

lvyanquan Mar 12, 2025

SML0127 Mar 12, 2025

lvyanquan commented Mar 26, 2025

yuxiqian commented Mar 26, 2025

ruanhang1993 Mar 27, 2025

ruanhang1993 Mar 27, 2025

ruanhang1993 Mar 27, 2025

ruanhang1993 Mar 27, 2025

ruanhang1993 Mar 27, 2025

ruanhang1993 Mar 27, 2025

ruanhang1993 Mar 27, 2025

ruanhang1993 Mar 27, 2025

ruanhang1993 Mar 27, 2025

ruanhang1993 Mar 27, 2025

-public class IcebergDataSinkOptions {
+public class IcebergDataSinkOptions {
+    public static final ConfigOption<String> CATALOG_NAME =
+            key("catalog.name")
+                    .stringType()
+                    .noDefaultValue()
+                    .withDescription("The name of iceberg catalog to use");
+    public static final ConfigOption<String> NAMESPACE_NAME =
+            key("namespace.name")
+                    .stringType()
+                    .noDefaultValue()
+                    .withDescription("The name of iceberg namespace to use");


		import static org.apache.flink.cdc.common.types.DataTypeChecks.getFieldCount;

		/** Util class for {@link IcebergDataSink}. */

	/** Util class for {@link IcebergDataSink}. */
	/** Util class for types in {@link IcebergDataSink}. */

	public static Types.NestedField convertCDCColumnToIcebergField(
	public static Types.NestedField convertCdcColumnToIcebergField(

	throw new IllegalArgumentException("Illegal type: " + type);
	throw new IllegalArgumentException("Unsupported cdc type in iceberg: " + type);

	this.enabledSchemaEvolutionTypes = getSupportedSchemaEvolutionTypes();
	this(catalogOptions, new HashMap<>(), new HashMap<>());

-                Column addColumn = columnWithPosition.getAddColumn();
+                Column addColumn = columnWithPosition.getAddColumn();
+               String columnName = addColumn.getName();
+               LogicalType logicalType =
+                                FlinkSchemaUtil.convert(
+                                        DataTypeUtils.toFlinkDataType(addColumn.getType())
+                                                .getLogicalType());

[FLINK-36061][iceberg] Add Iceberg Sink. #3904

Are you sure you want to change the base?

[FLINK-36061][iceberg] Add Iceberg Sink. #3904

Conversation

lvyanquan commented Feb 5, 2025 • edited Loading

SML0127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvyanquan commented Mar 26, 2025

yuxiqian commented Mar 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvyanquan commented Feb 5, 2025 •

edited

Loading