Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async write support for ORC #11865

Merged
merged 5 commits into from
Dec 24, 2024
Merged

Conversation

jihoonson
Copy link
Collaborator

Follow up to #11730.

This PR adds the async write support for the ORC format as well as an integration test for it. This PR is marked as a draft since it depends on #11855.

@jihoonson jihoonson changed the title Async orc write Async write support for ORC Dec 11, 2024
@jihoonson
Copy link
Collaborator Author

build

@jihoonson jihoonson marked this pull request as ready for review December 14, 2024 01:18
@jihoonson
Copy link
Collaborator Author

build

@jihoonson
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator

abellina commented Dec 18, 2024

Any indication that this is a performance improvement for Orc? I assume so, but asking if you have data.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but want to hear about the coalesce question from @abellina

Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, barring the accidental inclusion of operatorsScore.csv and supportedExprs.csv.

@jihoonson
Copy link
Collaborator Author

jihoonson commented Dec 19, 2024

Any indication that this is a performance improvement for Orc? I assume so, but asking if you have data.

@abellina I haven't run any performance test yet, but agree it will be useful. Unfortunately, I'm quite stuck in other work right now. Do you mind if I run some benchmark later and post the results here?

@jihoonson
Copy link
Collaborator Author

build

gen_list = [('_c' + str(i), gen) for i, gen in enumerate(orc_gen)]
assert_gpu_and_cpu_writes_are_equal_collect(
lambda spark, path: gen_df(spark, gen_list, length=num_rows).write.orc(path),
lambda spark, path: spark.read.orc(path).orderBy([('_c' + str(i)) for i in range(num_cols)]),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the GPU and CPU orc readers can return the same rows in different orders. Added an orderBy to make the test deterministic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we have @ignore_order marker for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good to know. I will use this instead next time.

@jihoonson
Copy link
Collaborator Author

build

@jihoonson
Copy link
Collaborator Author

The test failure is filed in #11897.

@jihoonson
Copy link
Collaborator Author

build

@jihoonson
Copy link
Collaborator Author

Thanks all for the review!

@jihoonson jihoonson merged commit 0f702cd into NVIDIA:branch-25.02 Dec 24, 2024
49 of 50 checks passed
@sameerz sameerz added the performance A performance related task/issue label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants