Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding string row size iterator for row to column and column to row conversion #10157

Merged

Conversation

hyperbolic2346
Copy link
Contributor

This is the first step to supporting variable-width strings in the row to column and column to row code. It adds an iterator that reads the offset columns inside string columns to compute the row sizes of this variable-width data.

Note that this doesn't add support for strings yet, but is the first step in that direction.

closes #10111

@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner January 28, 2022 07:14
@github-actions github-actions bot added the Java Affects Java cuDF API. label Jan 28, 2022
@hyperbolic2346 hyperbolic2346 added 3 - Ready for Review Ready for review by team non-breaking Non-breaking change feature request New feature or request labels Jan 28, 2022
@codecov

This comment was marked as off-topic.

@jjacobelli
Copy link
Contributor

rerun tests

Comment on lines 218 to 225
auto data_iter = cudf::detail::make_counting_transform_iterator(
0, [d_string_columns_offsets = d_string_columns_offsets.data(), num_columns,
num_rows] __device__(auto element_idx) {
auto const row = element_idx / num_columns;
auto const col = element_idx % num_columns;

return d_string_columns_offsets[col][row + 1] - d_string_columns_offsets[col][row];
});
Copy link
Contributor

@ttnghia ttnghia Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, from my perspective this computation is inefficient. You are looping col-by-col. That means, for each row, you iteratively access all the cols before going to the next row. Each col will be accessed separately by num_rows times.

Copy link
Contributor

@ttnghia ttnghia Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

auto const row = element_idx % num_rows;
auto const col = element_idx / num_rows;
...

This way, you may not be able to use reduce_by_key. Instead, you need to initialize the d_row_offsets to zero (thrust::uninitialized_fill) then atomicAdd each output value.

Copy link
Contributor

@ttnghia ttnghia Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this solution is more efficient. It should if we have large number of columns. Otherwise I don't know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can have a benchmark to compare the solutions then it's great 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark max columns defaults to 100 and it seems far more likely to have a very large number of rows. With the requirement of keys being consecutive we can't simply flip the math. I will do some performance testing and report back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I performance tested this code and it seems this function runs in about 1.2ms on my PC for 50 columns and 1,000,000 rows of intermixed ints and string columns. With the changes to not use reduce_by_key and march the data in a more natural way this time drops to 0.75ms. This seems worth it even though it removes the chance of the cool transform output iterator suggested in review. Thanks for pushing for this. I dismissed it probably because I was excited to use reduce_by_key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revans2 is that default limit 100 or have I been led astray by my reading?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

march the data in a more natural way this time drops to 0.75ms.

Kudos, @ttnghia and @hyperbolic2346!
I'm having a hard time grokking why this iteration order is faster. All the string columns have to eventually be accessed num_rows times. So this should be a matter of... proximity? All threads in the warp acting on proximal locations in memory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old way: we access rows 0 of all columns 0, 1, 2, etc then we access rows 1 of all columns 0, 1, 2, etc and so on. Each row access will pull data from different columns from different locations in memory.
In the new way: we access rows 0, 1, 2, etc of column 0, then rows 0, 1, 2, etc of column 1 and so on. So the data is pulled from contiguous memory locations.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, but my C++ is not great so I am not going to approve this.

java/src/main/native/src/row_conversion.cu Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
java/src/main/native/src/row_conversion.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👍 on the changes, after the other reviewers' comments are addressed.
Thank you for your patience, @hyperbolic2346.

@hyperbolic2346
Copy link
Contributor Author

rerun tests

@hyperbolic2346
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit dcac052 into rapidsai:branch-22.04 Feb 11, 2022
@hyperbolic2346 hyperbolic2346 deleted the mwilson/string-iterator branch February 11, 2022 05:25
rapids-bot bot pushed a commit that referenced this pull request Mar 22, 2022
This is the code for the column to row portion of the string work. This code will convert a table that includes strings into the JCUDF row format. This depends on #10157 and as such, is a draft PR until that is merged. I am putting this up now so people working on reviewing that PR can see where it is headed.

closes #10234

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - MithunR (https://github.com/mythrocks)
  - https://github.com/nvdbaranec

URL: #10235
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add iterator for variable-length string data row offsets
7 participants