Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Iceberg 1.7.0 #442

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

singhpk234
Copy link

@singhpk234 singhpk234 commented Nov 9, 2024

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change.

Upgrade Apache Iceberg to 1.7.0
Additionally :

  • Removes PolarisS3FileIOFactory as it's not required anymore (if it's a public class, we might wanna see this deprecation)
  • Add Endpoint discovery support as newly added feature in 1.7
  • Fix Nested Namespace failure

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation update
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Existing Suite

Test Configuration:

  • Hardware:
  • Toolchain:
  • SDK:

Checklist:

Please delete options that are not relevant.

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue

Comment on lines -2058 to -2059
propertiesWithS3CustomizedClientFactory.put(
S3FileIOProperties.CLIENT_FACTORY, PolarisS3FileIOClientFactory.class.getName());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer required because of this : https://github.com/apache/iceberg/pull/11259/files

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might have to mark this deprecated first since it's a public class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was only ever used for this purpose, so we're probably fine

@flyrain
Copy link
Contributor

flyrain commented Nov 9, 2024

LGTM with a minor style issue:

> The following files had format violations:
      src/main/java/org/apache/polaris/service/catalog/BasePolarisCatalog.java
          @@ -47,7 +47,6 @@
           import·org.apache.iceberg.TableMetadata;
           import·org.apache.iceberg.TableMetadataParser;
           import·org.apache.iceberg.TableOperations;
          -import·org.apache.iceberg.aws.s3.S3FileIOProperties;
           import·org.apache.iceberg.catalog.Namespace;
           import·org.apache.iceberg.catalog.SupportsNamespaces;
           import·org.apache.iceberg.catalog.TableIdentifier;
  Run './gradlew :polaris-service:spotlessApply' to fix these violations.

@@ -277,6 +277,7 @@ commons-collections:commons-collections
commons-io:commons-io
commons-logging:commons-logging
commons-net:commons-net
dev.failsafe:failsafe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is it a new transitive dependency brought by iceberg 1.7?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, it's added as part of AWS S3InputStream retry handling via : https://github.com/apache/iceberg/pull/10433/files

@singhpk234
Copy link
Author

checking test failure

@singhpk234
Copy link
Author

on debugging it, looks like we need to handle this change cleanly [1], since polaris supports more endpoint than default impl,
let me fix this !

[1] apache/iceberg@a2b8008#diff-86450612dbe323d6d06cbc3846aa1913f042eaedadc0ca027c36bfbe08d3a46c

@@ -341,7 +341,7 @@ public void testIcebergListNamespacesNestedNotFound() throws IOException {
sessionCatalog.listNamespaces(
sessionContext, Namespace.of("top_level", "whoops")))
.isInstanceOf(NoSuchNamespaceException.class)
.hasMessage("Namespace does not exist: top_level.whoops");
.hasMessage("Namespace does not exist: top_level%1Fwhoops");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found this change : apache/iceberg@5fc1413 looks like we changes the seperator,
pending to make it configurable apache/iceberg#10877

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no more intent to make the separator configurable. At least, that's the latest I know. There was a big discussion around this topic in August this year.
IMHO, making the separator configurable makes things overly complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the response in the error message now? I think the dot in the error is much more user friendly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, as Iceberg still uses dot to join different level of namespaces, https://github.com/apache/iceberg/blob/09634857e4a1333f5dc742d1dca3921e9a9f62dd/api/src/main/java/org/apache/iceberg/catalog/Namespace.java#L97-L97.
In this case, Namespace.of("top_level", "whoops") should be referred to top_level.whoops, unless the rest util somehow squash two levels together with the "%1F".
@singhpk234 , could you take a look?

Copy link
Author

@singhpk234 singhpk234 Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops looks like it was a style miss : https://github.com/apache/polaris/actions/runs/11859822454/job/33055322598
Here are the local results :

Screenshot 2024-11-16 at 5 47 43 PM

error : org.apache.iceberg.exceptions.NoSuchNamespaceException: Namespace does not exist: ns1?ns1a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the revert PR (apache/iceberg#11574) is targeting Iceberg 1.7.1. Should we wait for the 1.7.1 release to avoid unnecessary back-and-forth?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think it's fair to wait as we not are chasing a release time in Polaris and 1.7.1 should be fast, I saw an RC is cut already apache/iceberg#11593

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see, when the pr get merged and we have nightly artifact published, I can try reverting my change to correct the test and run our polaris suite with that ? wdyt

Copy link
Author

@singhpk234 singhpk234 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can confirm its passing now, I have pointed to 1.8.0 artifact (can't find 1.7.1 in nightly) from nigthly, will change this 1.7.1 when it get released

@MonkeyCanCode
Copy link
Contributor

@flyrain @singhpk234 should we update iceberg 1.7.0 dependencies for "getting start" examples as well? Most of them are using 1.5.2 iceberg runtime. I can take this if not already done.

@snazy
Copy link
Member

snazy commented Nov 11, 2024

Note: Today I realized that there are two "special" changes in Iceberg/Java 1.7.0:

  1. Object-storage layout - the generated paths have changed. This is relevant when IAM policies for S3 + GCS are generated. IIRC, IAM policies generated by Polaris are not yet adopted for object-storage layout (write.object-storage.enabled=true + data path). Before 1.7.0, the paths were generated like s3://bucket/<random-ish-base64>/..., since 1.7.0 it's s3://bucket/<random-ish-binary-part-1>/<random-ish-binary-part-2>/<random-ish-binary-part-3>/<random-ish-binary-part-4>/...)
  2. Iceberg/Java 1.7.0 refuses to submit some requests like the multi-table-commit, if the service does not return ConfigResponse.endpoints() or that list is empty.

@flyrain
Copy link
Contributor

flyrain commented Nov 11, 2024

Object-storage layout - the generated paths have changed. This is relevant when IAM policies for S3 + GCS are generated. IIRC, IAM policies generated by Polaris are not yet adopted for object-storage layout (write.object-storage.enabled=true + data path). Before 1.7.0, the paths were generated like s3://bucket//..., since 1.7.0 it's s3://bucket/////...)

This is due to the new Object Storage v3 changes introduced in the 1.7 release. However, since Polaris doesn’t currently support the hash prefix before the table path, this update should not impact functionality as far as I understand. Let me know if I’m missing any nuances here!

Iceberg/Java 1.7.0 refuses to submit some requests like the multi-table-commit, if the service does not return ConfigResponse.endpoints() or that list is empty.

#383 will resolve this issue, so let's prioritize it. In the short term, it should be manageable since most REST clients haven’t yet adopted the latest 1.7 release.

@snazy
Copy link
Member

snazy commented Nov 11, 2024

Object-storage layout - the generated paths have changed. This is relevant when IAM policies for S3 + GCS are generated. IIRC, IAM policies generated by Polaris are not yet adopted for object-storage layout (write.object-storage.enabled=true + data path). Before 1.7.0, the paths were generated like s3://bucket//..., since 1.7.0 it's s3://bucket/////...)

This is due to the new Object Storage v3 changes introduced in the 1.7 release. However, since Polaris doesn’t currently support the hash prefix before the table path, this update should not impact functionality as far as I understand. Let me know if I’m missing any nuances here!

For Polaris it's the amount of combinations that need to be supported: no object-storage prefix, the old and the new one. For GCS it's easier, because Google accepts regular expressions - S3 however does not.

Iceberg/Java 1.7.0 refuses to submit some requests like the multi-table-commit, if the service does not return ConfigResponse.endpoints() or that list is empty.

#383 will resolve this issue, so let's prioritize it. In the short term, it should be manageable since most REST clients haven’t yet adopted the latest 1.7 release.

@flyrain
Copy link
Contributor

flyrain commented Nov 11, 2024

@flyrain @singhpk234 should we update iceberg 1.7.0 dependencies for "getting start" examples as well? Most of them are using 1.5.2 iceberg runtime. I can take this if not already done.

+1, @MonkeyCanCode! Thanks for picking it up as a follow-up.

@MonkeyCanCode
Copy link
Contributor

MonkeyCanCode commented Nov 11, 2024

@flyrain @singhpk234 should we update iceberg 1.7.0 dependencies for "getting start" examples as well? Most of them are using 1.5.2 iceberg runtime. I can take this if not already done.

+1, @MonkeyCanCode! Thanks for picking it up as a follow-up.

Anytime. Will PR this once current PR is merged. Then i can build it locally for both client (spark) and server (polaris).

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Thanks @singhpk234 for working on it. Hi @dennishuo @collado-mike @eric-maynard, could you take a look as well?

@flyrain
Copy link
Contributor

flyrain commented Nov 13, 2024

I will merge this by EOD if there is no objection.

@MonkeyCanCode MonkeyCanCode mentioned this pull request Nov 14, 2024
9 tasks
@MonkeyCanCode
Copy link
Contributor

Here is a PR for the 2 missing changes: #447

@@ -19,7 +19,7 @@

[versions]
hadoop = "3.4.0"
iceberg = "1.6.1"
iceberg = "1.8.0-SNAPSHOT"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to 1.7.1 when released please ref this discussion thread #442 (comment)

@@ -69,6 +69,7 @@ dependencyResolutionManagement {
repositories {
mavenCentral()
gradlePluginPortal()
maven(url = "https://repository.apache.org/content/repositories/snapshots")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this when 1.7.1 is release please ref this thread : #442 (comment)

@RussellSpitzer
Copy link
Member

Took this branch and ran with 1.7.1-RC1 of Iceberg. All tests passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants