Skip to content

Add regtests for Spark client to test built jars #1402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Apr 24, 2025

Conversation

gh-yzou
Copy link
Contributor

@gh-yzou gh-yzou commented Apr 18, 2025

This PR does the following:

  1. Fixes the jar job for Spark Client by excluding problematic jars.
  2. Add regtests fro spark client to test the spark usage with built jars, it follows how polaris/regtests works today, which supports both docker run and local run. The docker run is used in the ci run also.
  3. Add new github workflow to run regtests for spark client.

flyrain
flyrain previously approved these changes Apr 22, 2025
Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Left questions and minor comments.

AWS_ACCESS_KEY_ID: ${{secrets.AWS_ACCESS_KEY_ID}}
AWS_SECRET_ACCESS_KEY: ${{secrets.AWS_SECRET_ACCESS_KEY}}
run: |
docker compose -f plugins/spark/v3.5/regtests/docker-compose.yml up --build --exit-code-from regtest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: an empty line at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog
spark.sql.catalog.polaris.type=rest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: this is optional, should we just remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog
spark.sql.catalog.polaris.type=rest
spark.sql.catalog.polaris.uri=http://${POLARIS_HOST:-localhost}:8181/api/catalog
spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment here that this is needed for iceberg table only now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

spark.sql.catalog.polaris.uri=http://${POLARIS_HOST:-localhost}:8181/api/catalog
spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials
spark.sql.catalog.polaris.client.region=us-west-2
spark.sql.sources.useV1SourceList=''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: do we need this config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is needed because we polaris spark client relies on DataSourceV2 to load spark tables, this is to make sure no V1 source is used during resolve provider. Added comments for this line also

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting only DSv2 has a very big perf concern (please ref this blog https://databeans-blogs.medium.com/delta-vs-iceberg-vs-hudi-reassessing-performance-cb8157005eb0) this is mostly due to Delta being in v1 and icberg in v2 and some of the features of Delta doesn't work, do we document that ?

Also why would we change native parquet to v2 (not just delta parquet) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is change parquet to v2, it just returns DeltaTableV2. For Delta table, even if i return V1Table, Delta Catalog will reconstruct a DeltaTableV2 out of it, if I recall correctly, even with DeltaTableV2, some oeperations might still fall back to V1. Today, we rely on the DataSourceV2.loadTable to help construct SparkTable, there is no such utility for DataSourceV1 yet, if we run into problems in the near future for other table types like csv, we can support for constructing V1Tables.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is change parquet to v2, it just returns DeltaTableV2

It does, please ref default value of this config : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3882

If I recall correctly, even with DeltaTableV2, some oeperations might still fall back to V1

Thats true, what do we do in that case fail ? Otherwise it can cause correctness issue.

there is no such utility for DataSourceV1 yet, if we run into problems in the near future for other table types like csv

Sure, just we need to make sure we document DSv1 not supported and then let customer decide, Delta is trying to move to DSv2 but it's atleast a year from now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i will put an extra limitation section in the spark read me to document the current known limitations

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Apr 22, 2025
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog
spark.sql.catalog.polaris.uri=http://${POLARIS_HOST:-localhost}:8181/api/catalog
# this configuration is used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems truncated.

Suggested change
# this configuration is used
# this configuration is used only for Iceberg tables now. Generic tables doesn't support credential vending yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i think i actually forgot to complete the comment, updated

flyrain
flyrain previously approved these changes Apr 23, 2025
Comment on lines 78 to 88
Before you run the test, make sure you build the project to generate the Spark client jars.
```shell
./gradlew build
```

In this setup, a Polaris server must be running on localhost:8181 before running tests. The simplest
way to do this is to run the Polaris server in a separate terminal window:

```shell
./gradlew run
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could simplify it like the following, we could still keep the "Note:xxx" part.

Suggested change
Before you run the test, make sure you build the project to generate the Spark client jars.
```shell
./gradlew build
```
In this setup, a Polaris server must be running on localhost:8181 before running tests. The simplest
way to do this is to run the Polaris server in a separate terminal window:
```shell
./gradlew run
```
- `./gradlew build` -- build the Spark Client jars.
- `./gradlew run` -- start a Polaris server on localhost:8181.
- `env POLARIS_HOST=localhost ./plugins/spark/v3.5/regtests/run.sh` -- run regtests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg! updated

flyrain
flyrain previously approved these changes Apr 23, 2025
Comment on lines 58 to 59
AWS_ACCESS_KEY_ID: ${{secrets.AWS_ACCESS_KEY_ID}}
AWS_SECRET_ACCESS_KEY: ${{secrets.AWS_SECRET_ACCESS_KEY}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] why only AWS, not GCP / Azure ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, i was mainly following how polaris/regtests is working today. My main purpose for the regtests is to make sure spark can start with the packed jars and perform basic operations right now, so this should be enough for current purpose. We are likely going to add more tests to cover more usages later.

# under the License.
#

FROM docker.io/apache/spark:3.5.5-java17
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are running java 21 in CI why are pulling java 17 here then ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mainly follows how existing regtests works today, it pulls an existing spark image, from the published spark docker image here https://hub.docker.com/r/apache/spark/tags, it seems we only have java 11 and java 17 for spark 3.5.5.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets then make the CI use 17 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter what CI uses since we are starting in the docker? and the CI does more than just running spark docker, it also starts container with server, which does Polaris project build and starts Polaris server, the Polaris server will use java 21. Further more, i would like to be consistent with how the current regtests CI looks like today

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the jvm in CI are marked as stamp of approval that this works for this jvm version both runtime and compile wise, building these in JVM requires special flags as per JVM version.
Saying this from my past experience of upgrading Iceberg with JDK 17 : apache/iceberg#7391

if we can only assert polaris spark client with JDK 17, then we should have only JDK 17

If we are saying polaris-spark client will work on java 21 (both build and runtime) then i think this is fine to have JDK 21, (though i would like to understand more on how this is guaranteed, when spark doesn't do that ?)

prev tests were not building spark client of their own, they were using it, just asserting the Polaris is ok to work with 21 so 21 in CI made sense.

Atleast thats my read from it.

Copy link
Contributor Author

@gh-yzou gh-yzou Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regression test starts two containers, one for Polaris server, and anther one for spark. The Polaris server build and run will all be with Java21, the CI does server image build, so it needs to be with java 21 since that is what the server supports. The spark image is directly downloaded from docker, and docker can manage java version and dependency separately, so it doesn't needs to be consistent with the CI java used.
So we have our polaris server container runs with java21, and our spark container runs with java18, this is actually the closet setup to how the server client env setup and jar release happens for a user.

The iceberg ci setup is different, it is mostly just integration tests, which directly builds and runs with the CI env, not docker container. Our integration tests are all running with Java 21 in another CI today, so basically we have test that covers the case that our spark client works with both java 17 and java 21. We will need to add better CI support for different java version with integration tests later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, will add CI to run regression tests on different Java versions, at least 11 and 17

Comment on lines 30 to 33
GOOGLE_APPLICATION_CREDENTIALS: $GOOGLE_APPLICATION_CREDENTIALS
AZURE_TENANT_ID: $AZURE_TENANT_ID
AZURE_CLIENT_ID: $AZURE_CLIENT_ID
AZURE_CLIENT_SECRET: $AZURE_CLIENT_SECRET
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using it ?

Copy link
Contributor Author

@gh-yzou gh-yzou Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are using it today, this mostly follows how regtests works today, we will use it later when I start adding very cloud specific test cases

Copy link
Collaborator

@singhpk234 singhpk234 Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add when we add these test then ?

Copy link
Contributor Author

@gh-yzou gh-yzou Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i removed those credentials for now, we will add it when i add cloud specific tests. I also removed the credential folder

insert into iceberg_tb values (123), (234), (111);
select * from iceberg_tb order by col1;

create table delta_tb1(col1 string) using delta location 'file:///tmp/spark_catalog/delta_tb1';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] should this be path of s3 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i actually intend to just do local for now, I need to check with the team to see what are the buckets we allowed write into in the CI before i add other cloud specific tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove these then ? add when we add it

Copy link
Contributor Author

@gh-yzou gh-yzou Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean remove location in the query? Iour client side currently doesn't work for table creation without explicit location specification, whose table location are managed by spark. Therefore, i have to explicitly specify the location in the query for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I meant reference to any cloud provider code if this is something planned for future ? its fine to run in local imho

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did one pass to remove all cloud specific configurations for the setup and docker. This test file doens't have any cloud specific part, so I was confused

show tables;

use db1;
create table delta_tb2(col1 int) using delta location 'file:///tmp/spark_catalog/delta_tb2';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about CTAS ? should we add that as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CTAS is unfortunately not supported yet, there is a limitation in how Delta log work today with the third party catalog, we will need to follow up on the support based on the use case. And I only intend to use regtests for very basic tests, and we will use integration tests for detailed testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, lets just document then, not sure whats the best place to document this.

And I only intend to use regtests for very basic tests

I understand, CTAS is not that complicated honestly, but i don't feel strongly about this, so we are good ! just request you to document then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, CTAS is not complicated, but make CTAS work together with Delta Catalog and third party catalog of spark is not that straightforward. Yes, I will doc all limitations in our README, which is updated in my other PR.

@gh-yzou gh-yzou requested a review from singhpk234 April 23, 2025 17:03
Copy link
Collaborator

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks @gh-yzou !

@flyrain flyrain merged commit 6e45ef7 into apache:main Apr 24, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Apr 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants