-
Notifications
You must be signed in to change notification settings - Fork 225
Add regtests for Spark client to test built jars #1402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Left questions and minor comments.
AWS_ACCESS_KEY_ID: ${{secrets.AWS_ACCESS_KEY_ID}} | ||
AWS_SECRET_ACCESS_KEY: ${{secrets.AWS_SECRET_ACCESS_KEY}} | ||
run: | | ||
docker compose -f plugins/spark/v3.5/regtests/docker-compose.yml up --build --exit-code-from regtest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: an empty line at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
plugins/spark/v3.5/regtests/setup.sh
Outdated
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension | ||
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog | ||
spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog | ||
spark.sql.catalog.polaris.type=rest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: this is optional, should we just remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog | ||
spark.sql.catalog.polaris.type=rest | ||
spark.sql.catalog.polaris.uri=http://${POLARIS_HOST:-localhost}:8181/api/catalog | ||
spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment here that this is needed for iceberg table only now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
spark.sql.catalog.polaris.uri=http://${POLARIS_HOST:-localhost}:8181/api/catalog | ||
spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials | ||
spark.sql.catalog.polaris.client.region=us-west-2 | ||
spark.sql.sources.useV1SourceList='' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: do we need this config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is needed because we polaris spark client relies on DataSourceV2 to load spark tables, this is to make sure no V1 source is used during resolve provider. Added comments for this line also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supporting only DSv2 has a very big perf concern (please ref this blog https://databeans-blogs.medium.com/delta-vs-iceberg-vs-hudi-reassessing-performance-cb8157005eb0) this is mostly due to Delta being in v1 and icberg in v2 and some of the features of Delta doesn't work, do we document that ?
Also why would we change native parquet to v2 (not just delta parquet) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this is change parquet to v2, it just returns DeltaTableV2. For Delta table, even if i return V1Table, Delta Catalog will reconstruct a DeltaTableV2 out of it, if I recall correctly, even with DeltaTableV2, some oeperations might still fall back to V1. Today, we rely on the DataSourceV2.loadTable to help construct SparkTable, there is no such utility for DataSourceV1 yet, if we run into problems in the near future for other table types like csv, we can support for constructing V1Tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this is change parquet to v2, it just returns DeltaTableV2
It does, please ref default value of this config : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3882
If I recall correctly, even with DeltaTableV2, some oeperations might still fall back to V1
Thats true, what do we do in that case fail ? Otherwise it can cause correctness issue.
there is no such utility for DataSourceV1 yet, if we run into problems in the near future for other table types like csv
Sure, just we need to make sure we document DSv1 not supported and then let customer decide, Delta is trying to move to DSv2 but it's atleast a year from now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i will put an extra limitation section in the spark read me to document the current known limitations
ef85717
to
59aff04
Compare
plugins/spark/v3.5/regtests/setup.sh
Outdated
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog | ||
spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog | ||
spark.sql.catalog.polaris.uri=http://${POLARIS_HOST:-localhost}:8181/api/catalog | ||
# this configuration is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems truncated.
# this configuration is used | |
# this configuration is used only for Iceberg tables now. Generic tables doesn't support credential vending yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, i think i actually forgot to complete the comment, updated
Before you run the test, make sure you build the project to generate the Spark client jars. | ||
```shell | ||
./gradlew build | ||
``` | ||
|
||
In this setup, a Polaris server must be running on localhost:8181 before running tests. The simplest | ||
way to do this is to run the Polaris server in a separate terminal window: | ||
|
||
```shell | ||
./gradlew run | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could simplify it like the following, we could still keep the "Note:xxx" part.
Before you run the test, make sure you build the project to generate the Spark client jars. | |
```shell | |
./gradlew build | |
``` | |
In this setup, a Polaris server must be running on localhost:8181 before running tests. The simplest | |
way to do this is to run the Polaris server in a separate terminal window: | |
```shell | |
./gradlew run | |
``` | |
- `./gradlew build` -- build the Spark Client jars. | |
- `./gradlew run` -- start a Polaris server on localhost:8181. | |
- `env POLARIS_HOST=localhost ./plugins/spark/v3.5/regtests/run.sh` -- run regtests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg! updated
AWS_ACCESS_KEY_ID: ${{secrets.AWS_ACCESS_KEY_ID}} | ||
AWS_SECRET_ACCESS_KEY: ${{secrets.AWS_SECRET_ACCESS_KEY}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[doubt] why only AWS, not GCP / Azure ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason, i was mainly following how polaris/regtests is working today. My main purpose for the regtests is to make sure spark can start with the packed jars and perform basic operations right now, so this should be enough for current purpose. We are likely going to add more tests to cover more usages later.
# under the License. | ||
# | ||
|
||
FROM docker.io/apache/spark:3.5.5-java17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are running java 21 in CI why are pulling java 17 here then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mainly follows how existing regtests works today, it pulls an existing spark image, from the published spark docker image here https://hub.docker.com/r/apache/spark/tags, it seems we only have java 11 and java 17 for spark 3.5.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets then make the CI use 17 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it matter what CI uses since we are starting in the docker? and the CI does more than just running spark docker, it also starts container with server, which does Polaris project build and starts Polaris server, the Polaris server will use java 21. Further more, i would like to be consistent with how the current regtests CI looks like today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally the jvm in CI are marked as stamp of approval that this works for this jvm version both runtime and compile wise, building these in JVM requires special flags as per JVM version.
Saying this from my past experience of upgrading Iceberg with JDK 17 : apache/iceberg#7391
if we can only assert polaris spark client with JDK 17, then we should have only JDK 17
If we are saying polaris-spark client will work on java 21 (both build and runtime) then i think this is fine to have JDK 21, (though i would like to understand more on how this is guaranteed, when spark doesn't do that ?)
prev tests were not building spark client of their own, they were using it, just asserting the Polaris is ok to work with 21 so 21 in CI made sense.
Atleast thats my read from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regression test starts two containers, one for Polaris server, and anther one for spark. The Polaris server build and run will all be with Java21, the CI does server image build, so it needs to be with java 21 since that is what the server supports. The spark image is directly downloaded from docker, and docker can manage java version and dependency separately, so it doesn't needs to be consistent with the CI java used.
So we have our polaris server container runs with java21, and our spark container runs with java18, this is actually the closet setup to how the server client env setup and jar release happens for a user.
The iceberg ci setup is different, it is mostly just integration tests, which directly builds and runs with the CI env, not docker container. Our integration tests are all running with Java 21 in another CI today, so basically we have test that covers the case that our spark client works with both java 17 and java 21. We will need to add better CI support for different java version with integration tests later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, will add CI to run regression tests on different Java versions, at least 11 and 17
GOOGLE_APPLICATION_CREDENTIALS: $GOOGLE_APPLICATION_CREDENTIALS | ||
AZURE_TENANT_ID: $AZURE_TENANT_ID | ||
AZURE_CLIENT_ID: $AZURE_CLIENT_ID | ||
AZURE_CLIENT_SECRET: $AZURE_CLIENT_SECRET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we are using it today, this mostly follows how regtests works today, we will use it later when I start adding very cloud specific test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add when we add these test then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i removed those credentials for now, we will add it when i add cloud specific tests. I also removed the credential folder
insert into iceberg_tb values (123), (234), (111); | ||
select * from iceberg_tb order by col1; | ||
|
||
create table delta_tb1(col1 string) using delta location 'file:///tmp/spark_catalog/delta_tb1'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[doubt] should this be path of s3 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i actually intend to just do local for now, I need to check with the team to see what are the buckets we allowed write into in the CI before i add other cloud specific tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove these then ? add when we add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean remove location in the query? Iour client side currently doesn't work for table creation without explicit location specification, whose table location are managed by spark. Therefore, i have to explicitly specify the location in the query for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I meant reference to any cloud provider code if this is something planned for future ? its fine to run in local imho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did one pass to remove all cloud specific configurations for the setup and docker. This test file doens't have any cloud specific part, so I was confused
show tables; | ||
|
||
use db1; | ||
create table delta_tb2(col1 int) using delta location 'file:///tmp/spark_catalog/delta_tb2'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about CTAS ? should we add that as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CTAS is unfortunately not supported yet, there is a limitation in how Delta log work today with the third party catalog, we will need to follow up on the support based on the use case. And I only intend to use regtests for very basic tests, and we will use integration tests for detailed testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, lets just document then, not sure whats the best place to document this.
And I only intend to use regtests for very basic tests
I understand, CTAS is not that complicated honestly, but i don't feel strongly about this, so we are good ! just request you to document then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, CTAS is not complicated, but make CTAS work together with Delta Catalog and third party catalog of spark is not that straightforward. Yes, I will doc all limitations in our README, which is updated in my other PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Thanks @gh-yzou !
This PR does the following: