-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding string row size iterator for row to column and column to row conversion #10157
Adding string row size iterator for row to column and column to row conversion #10157
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
rerun tests |
auto data_iter = cudf::detail::make_counting_transform_iterator( | ||
0, [d_string_columns_offsets = d_string_columns_offsets.data(), num_columns, | ||
num_rows] __device__(auto element_idx) { | ||
auto const row = element_idx / num_columns; | ||
auto const col = element_idx % num_columns; | ||
|
||
return d_string_columns_offsets[col][row + 1] - d_string_columns_offsets[col][row]; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, from my perspective this computation is inefficient. You are looping col-by-col. That means, for each row, you iteratively access all the cols before going to the next row. Each col will be accessed separately by num_rows
times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
auto const row = element_idx % num_rows;
auto const col = element_idx / num_rows;
...
This way, you may not be able to use reduce_by_key
. Instead, you need to initialize the d_row_offsets
to zero (thrust::uninitialized_fill
) then atomicAdd
each output value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this solution is more efficient. It should if we have large number of columns. Otherwise I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can have a benchmark to compare the solutions then it's great 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark max columns defaults to 100 and it seems far more likely to have a very large number of rows. With the requirement of keys being consecutive we can't simply flip the math. I will do some performance testing and report back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I performance tested this code and it seems this function runs in about 1.2ms on my PC for 50 columns and 1,000,000 rows of intermixed ints and string columns. With the changes to not use reduce_by_key
and march the data in a more natural way this time drops to 0.75ms. This seems worth it even though it removes the chance of the cool transform output iterator suggested in review. Thanks for pushing for this. I dismissed it probably because I was excited to use reduce_by_key
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@revans2 is that default limit 100 or have I been led astray by my reading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is 100.
But it actually calculates it with nesting levels too, so it is a bit more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
march the data in a more natural way this time drops to 0.75ms.
Kudos, @ttnghia and @hyperbolic2346!
I'm having a hard time grokking why this iteration order is faster. All the string columns have to eventually be accessed num_rows
times. So this should be a matter of... proximity? All threads in the warp acting on proximal locations in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old way: we access rows 0
of all columns 0
, 1
, 2
, etc then we access rows 1
of all columns 0
, 1
, 2
, etc and so on. Each row access will pull data from different columns from different locations in memory.
In the new way: we access rows 0
, 1
, 2
, etc of column 0
, then rows 0
, 1
, 2
, etc of column 1
and so on. So the data is pulled from contiguous memory locations.
Co-authored-by: Nghia Truong <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good, but my C++ is not great so I am not going to approve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 👍 on the changes, after the other reviewers' comments are addressed.
Thank you for your patience, @hyperbolic2346.
rerun tests |
Co-authored-by: Nghia Truong <[email protected]>
…udf into mwilson/string-iterator
@gpucibot merge |
This is the code for the column to row portion of the string work. This code will convert a table that includes strings into the JCUDF row format. This depends on #10157 and as such, is a draft PR until that is merged. I am putting this up now so people working on reviewing that PR can see where it is headed. closes #10234 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) - https://github.com/nvdbaranec URL: #10235
This is the first step to supporting variable-width strings in the row to column and column to row code. It adds an iterator that reads the offset columns inside string columns to compute the row sizes of this variable-width data.
Note that this doesn't add support for strings yet, but is the first step in that direction.
closes #10111