Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding string row size iterator for row to column and column to row conversion #10157
Adding string row size iterator for row to column and column to row conversion #10157
Changes from 2 commits
fdc251c
7c2557d
267b67e
cb5859c
86bcef1
d26d72d
3c17069
c27f2d0
2f17f86
b1b4d32
524f3c3
615cc8f
19a7149
1a28563
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, from my perspective this computation is inefficient. You are looping col-by-col. That means, for each row, you iteratively access all the cols before going to the next row. Each col will be accessed separately by
num_rows
times.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
This way, you may not be able to use
reduce_by_key
. Instead, you need to initialize thed_row_offsets
to zero (thrust::uninitialized_fill
) thenatomicAdd
each output value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this solution is more efficient. It should if we have large number of columns. Otherwise I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can have a benchmark to compare the solutions then it's great 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark max columns defaults to 100 and it seems far more likely to have a very large number of rows. With the requirement of keys being consecutive we can't simply flip the math. I will do some performance testing and report back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I performance tested this code and it seems this function runs in about 1.2ms on my PC for 50 columns and 1,000,000 rows of intermixed ints and string columns. With the changes to not use
reduce_by_key
and march the data in a more natural way this time drops to 0.75ms. This seems worth it even though it removes the chance of the cool transform output iterator suggested in review. Thanks for pushing for this. I dismissed it probably because I was excited to usereduce_by_key
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@revans2 is that default limit 100 or have I been led astray by my reading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is 100.
https://github.com/apache/spark/blob/899d3bb44d7c72dc0179545189ac8170bde993a8/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1356-L1362
But it actually calculates it with nesting levels too, so it is a bit more complicated.
https://github.com/apache/spark/blob/899d3bb44d7c72dc0179545189ac8170bde993a8/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L566-L576
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kudos, @ttnghia and @hyperbolic2346!
I'm having a hard time grokking why this iteration order is faster. All the string columns have to eventually be accessed
num_rows
times. So this should be a matter of... proximity? All threads in the warp acting on proximal locations in memory?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old way: we access rows
0
of all columns0
,1
,2
, etc then we access rows1
of all columns0
,1
,2
, etc and so on. Each row access will pull data from different columns from different locations in memory.In the new way: we access rows
0
,1
,2
, etc of column0
, then rows0
,1
,2
, etc of column1
and so on. So the data is pulled from contiguous memory locations.