Add regtests for Spark client to test built jars #1402

gh-yzou · 2025-04-18T17:22:08Z

This PR does the following:

Fixes the jar job for Spark Client by excluding problematic jars.
Add regtests fro spark client to test the spark usage with built jars, it follows how polaris/regtests works today, which supports both docker run and local run. The docker run is used in the ci run also.
Add new github workflow to run regtests for spark client.

flyrain

LGTM overall. Left questions and minor comments.

flyrain · 2025-04-22T22:04:43Z

.github/workflows/spark_client_regtests.yml

+          AWS_ACCESS_KEY_ID: ${{secrets.AWS_ACCESS_KEY_ID}}
+          AWS_SECRET_ACCESS_KEY: ${{secrets.AWS_SECRET_ACCESS_KEY}}
+        run: |
+          docker compose -f plugins/spark/v3.5/regtests/docker-compose.yml up --build --exit-code-from regtest


Nit: an empty line at the end

plugins/spark/v3.5/regtests/setup.sh

flyrain · 2025-04-22T22:15:30Z

plugins/spark/v3.5/regtests/setup.sh

+spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension
+spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
+spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog
+spark.sql.catalog.polaris.type=rest


Q: this is optional, should we just remove it?

flyrain · 2025-04-22T22:16:10Z

plugins/spark/v3.5/regtests/setup.sh

+spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog
+spark.sql.catalog.polaris.type=rest
+spark.sql.catalog.polaris.uri=http://${POLARIS_HOST:-localhost}:8181/api/catalog
+spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials


maybe add a comment here that this is needed for iceberg table only now

flyrain · 2025-04-22T22:16:38Z

plugins/spark/v3.5/regtests/setup.sh

+spark.sql.catalog.polaris.uri=http://${POLARIS_HOST:-localhost}:8181/api/catalog
+spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials
+spark.sql.catalog.polaris.client.region=us-west-2
+spark.sql.sources.useV1SourceList=''


Q: do we need this config?

yes, this is needed because we polaris spark client relies on DataSourceV2 to load spark tables, this is to make sure no V1 source is used during resolve provider. Added comments for this line also

Supporting only DSv2 has a very big perf concern (please ref this blog https://databeans-blogs.medium.com/delta-vs-iceberg-vs-hudi-reassessing-performance-cb8157005eb0) this is mostly due to Delta being in v1 and icberg in v2 and some of the features of Delta doesn't work, do we document that ?

Also why would we change native parquet to v2 (not just delta parquet) ?

i don't think this is change parquet to v2, it just returns DeltaTableV2. For Delta table, even if i return V1Table, Delta Catalog will reconstruct a DeltaTableV2 out of it, if I recall correctly, even with DeltaTableV2, some oeperations might still fall back to V1. Today, we rely on the DataSourceV2.loadTable to help construct SparkTable, there is no such utility for DataSourceV1 yet, if we run into problems in the near future for other table types like csv, we can support for constructing V1Tables.

i don't think this is change parquet to v2, it just returns DeltaTableV2

It does, please ref default value of this config : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3882

If I recall correctly, even with DeltaTableV2, some oeperations might still fall back to V1

Thats true, what do we do in that case fail ? Otherwise it can cause correctness issue.

there is no such utility for DataSourceV1 yet, if we run into problems in the near future for other table types like csv

Sure, just we need to make sure we document DSv1 not supported and then let customer decide, Delta is trying to move to DSv2 but it's atleast a year from now.

yes, i will put an extra limitation section in the spark read me to document the current known limitations

flyrain · 2025-04-23T16:29:08Z

plugins/spark/v3.5/regtests/setup.sh

+spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
+spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog
+spark.sql.catalog.polaris.uri=http://${POLARIS_HOST:-localhost}:8181/api/catalog
+# this configuration is used


It seems truncated.

Suggested change

# this configuration is used

# this configuration is used only for Iceberg tables now. Generic tables doesn't support credential vending yet.

sorry, i think i actually forgot to complete the comment, updated

flyrain · 2025-04-23T16:33:50Z

plugins/spark/v3.5/regtests/README.md

+Before you run the test, make sure you build the project to generate the Spark client jars.
+```shell
+./gradlew build
+```
+
+In this setup, a Polaris server must be running on localhost:8181 before running tests. The simplest
+way to do this is to run the Polaris server in a separate terminal window:
+
+```shell
+./gradlew run
+```


Maybe we could simplify it like the following, we could still keep the "Note:xxx" part.

Suggested change

Before you run the test, make sure you build the project to generate the Spark client jars.

```shell

./gradlew build

```

In this setup, a Polaris server must be running on localhost:8181 before running tests. The simplest

way to do this is to run the Polaris server in a separate terminal window:

```shell

./gradlew run

```

- `./gradlew build` -- build the Spark Client jars.

- `./gradlew run` -- start a Polaris server on localhost:8181.

- `env POLARIS_HOST=localhost ./plugins/spark/v3.5/regtests/run.sh` -- run regtests

sg! updated

singhpk234 · 2025-04-23T16:54:26Z

.github/workflows/spark_client_regtests.yml

+          AWS_ACCESS_KEY_ID: ${{secrets.AWS_ACCESS_KEY_ID}}
+          AWS_SECRET_ACCESS_KEY: ${{secrets.AWS_SECRET_ACCESS_KEY}}


[doubt] why only AWS, not GCP / Azure ?

No particular reason, i was mainly following how polaris/regtests is working today. My main purpose for the regtests is to make sure spark can start with the packed jars and perform basic operations right now, so this should be enough for current purpose. We are likely going to add more tests to cover more usages later.

singhpk234 · 2025-04-23T16:55:09Z

plugins/spark/v3.5/regtests/Dockerfile

+# under the License.
+#
+
+FROM docker.io/apache/spark:3.5.5-java17


we are running java 21 in CI why are pulling java 17 here then ?

This mainly follows how existing regtests works today, it pulls an existing spark image, from the published spark docker image here https://hub.docker.com/r/apache/spark/tags, it seems we only have java 11 and java 17 for spark 3.5.5.

lets then make the CI use 17 ?

does it matter what CI uses since we are starting in the docker? and the CI does more than just running spark docker, it also starts container with server, which does Polaris project build and starts Polaris server, the Polaris server will use java 21. Further more, i would like to be consistent with how the current regtests CI looks like today

Generally the jvm in CI are marked as stamp of approval that this works for this jvm version both runtime and compile wise, building these in JVM requires special flags as per JVM version.
Saying this from my past experience of upgrading Iceberg with JDK 17 : apache/iceberg#7391

if we can only assert polaris spark client with JDK 17, then we should have only JDK 17

If we are saying polaris-spark client will work on java 21 (both build and runtime) then i think this is fine to have JDK 21, (though i would like to understand more on how this is guaranteed, when spark doesn't do that ?)

prev tests were not building spark client of their own, they were using it, just asserting the Polaris is ok to work with 21 so 21 in CI made sense.

Atleast thats my read from it.

This regression test starts two containers, one for Polaris server, and anther one for spark. The Polaris server build and run will all be with Java21, the CI does server image build, so it needs to be with java 21 since that is what the server supports. The spark image is directly downloaded from docker, and docker can manage java version and dependency separately, so it doesn't needs to be consistent with the CI java used.
So we have our polaris server container runs with java21, and our spark container runs with java18, this is actually the closet setup to how the server client env setup and jar release happens for a user.

The iceberg ci setup is different, it is mostly just integration tests, which directly builds and runs with the CI env, not docker container. Our integration tests are all running with Java 21 in another CI today, so basically we have test that covers the case that our spark client works with both java 17 and java 21. We will need to add better CI support for different java version with integration tests later.

Discussed offline, will add CI to run regression tests on different Java versions, at least 11 and 17

singhpk234 · 2025-04-23T16:56:08Z

plugins/spark/v3.5/regtests/docker-compose.yml

+      GOOGLE_APPLICATION_CREDENTIALS: $GOOGLE_APPLICATION_CREDENTIALS
+      AZURE_TENANT_ID: $AZURE_TENANT_ID
+      AZURE_CLIENT_ID: $AZURE_CLIENT_ID
+      AZURE_CLIENT_SECRET: $AZURE_CLIENT_SECRET


Are we using it ?

I don't think we are using it today, this mostly follows how regtests works today, we will use it later when I start adding very cloud specific test cases

lets add when we add these test then ?

i removed those credentials for now, we will add it when i add cloud specific tests. I also removed the credential folder

singhpk234 · 2025-04-23T16:58:43Z

plugins/spark/v3.5/regtests/spark_sql.sh

+insert into iceberg_tb values (123), (234), (111);
+select * from iceberg_tb order by col1;
+
+create table delta_tb1(col1 string) using delta location 'file:///tmp/spark_catalog/delta_tb1';


[doubt] should this be path of s3 ?

no, i actually intend to just do local for now, I need to check with the team to see what are the buckets we allowed write into in the CI before i add other cloud specific tests.

Lets remove these then ? add when we add it

Do you mean remove location in the query? Iour client side currently doesn't work for table creation without explicit location specification, whose table location are managed by spark. Therefore, i have to explicitly specify the location in the query for now.

No I meant reference to any cloud provider code if this is something planned for future ? its fine to run in local imho

i did one pass to remove all cloud specific configurations for the setup and docker. This test file doens't have any cloud specific part, so I was confused

singhpk234 · 2025-04-23T17:00:11Z

plugins/spark/v3.5/regtests/spark_sql.sh

+show tables;
+
+use db1;
+create table delta_tb2(col1 int) using delta location 'file:///tmp/spark_catalog/delta_tb2';


How about CTAS ? should we add that as well

CTAS is unfortunately not supported yet, there is a limitation in how Delta log work today with the third party catalog, we will need to follow up on the support based on the use case. And I only intend to use regtests for very basic tests, and we will use integration tests for detailed testing.

Got it, lets just document then, not sure whats the best place to document this.

And I only intend to use regtests for very basic tests

I understand, CTAS is not that complicated honestly, but i don't feel strongly about this, so we are good ! just request you to document then.

Yes, CTAS is not complicated, but make CTAS work together with Delta Catalog and third party catalog of spark is not that straightforward. Yes, I will doc all limitations in our README, which is updated in my other PR.

singhpk234

LGTM ! Thanks @gh-yzou !

gh-yzou requested review from adutra, ashvina, dennishuo, dimas-b, eric-maynard, jackye1995, jbonofre, vvcephei and collado-mike as code owners April 18, 2025 17:22

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Apr 18, 2025

github-project-automation bot added this to Basic Kanban Board Apr 18, 2025

gh-yzou requested review from snazy, RussellSpitzer, takidau, MonkeyCanCode, flyrain and ebyhr as code owners April 18, 2025 17:22

flyrain previously approved these changes Apr 22, 2025

View reviewed changes

github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Apr 22, 2025

gh-yzou added 11 commits April 22, 2025 16:15

ad retest setup

243b79a

add change

8f9d3ac

fix creadential

177605c

add docker

d7e469e

add regtests for built jars

802d03e

add comments

076a156

remove unncessary build

d43e8bc

add back

ad76d5b

simplify jars

22e7594

remove dead code

a6f1374

address feedback

a7f52b3

gh-yzou added 2 commits April 22, 2025 16:15

update CI name to Spark Client Regression Test

e12453c

address feedback

59aff04

gh-yzou dismissed flyrain’s stale review via 59aff04 April 23, 2025 01:20

gh-yzou force-pushed the yzou-client-regtests branch from ef85717 to 59aff04 Compare April 23, 2025 01:20

gh-yzou requested a review from flyrain April 23, 2025 01:22

flyrain reviewed Apr 23, 2025

View reviewed changes

gh-yzou added 2 commits April 23, 2025 09:33

update comment

3463811

udpate grammer

b517f30

flyrain previously approved these changes Apr 23, 2025

View reviewed changes

gh-yzou dismissed flyrain’s stale review via b517f30 April 23, 2025 16:35

address comments

075b4c0

flyrain previously approved these changes Apr 23, 2025

View reviewed changes

singhpk234 reviewed Apr 23, 2025

View reviewed changes

gh-yzou requested a review from singhpk234 April 23, 2025 17:03

remove credentials

5dc9a1e

gh-yzou dismissed flyrain’s stale review via 5dc9a1e April 23, 2025 17:44

gh-yzou added 8 commits April 23, 2025 10:48

remove cloud specific config

9c760eb

remove unused credential

0b752cf

remove aws env var

fd1d61e

add change

19360ea

remove credential from dcker

e19d178

remove aws setting

652a2db

add comment

8f3074c

fix comment

f538f95

singhpk234 approved these changes Apr 24, 2025

View reviewed changes

flyrain approved these changes Apr 24, 2025

View reviewed changes

flyrain merged commit 6e45ef7 into apache:main Apr 24, 2025
6 checks passed

github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Apr 24, 2025

	# this configuration is used
	# this configuration is used only for Iceberg tables now. Generic tables doesn't support credential vending yet.

		AWS_ACCESS_KEY_ID: ${{secrets.AWS_ACCESS_KEY_ID}}
		AWS_SECRET_ACCESS_KEY: ${{secrets.AWS_SECRET_ACCESS_KEY}}

Add regtests for Spark client to test built jars #1402

Add regtests for Spark client to test built jars #1402

Conversation

gh-yzou commented Apr 18, 2025 • edited Loading

flyrain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gh-yzou Apr 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gh-yzou Apr 23, 2025 • edited Loading

Choose a reason for hiding this comment

singhpk234 Apr 23, 2025 • edited Loading

Choose a reason for hiding this comment

gh-yzou Apr 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gh-yzou Apr 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singhpk234 left a comment

Choose a reason for hiding this comment

gh-yzou commented Apr 18, 2025 •

edited

Loading

gh-yzou Apr 23, 2025 •

edited

Loading

gh-yzou Apr 23, 2025 •

edited

Loading

singhpk234 Apr 23, 2025 •

edited

Loading

gh-yzou Apr 23, 2025 •

edited

Loading

gh-yzou Apr 23, 2025 •

edited

Loading