Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent grid stride loop overflow in libcudf kernels #10368

Open
nvdbaranec opened this issue Feb 28, 2022 · 15 comments
Open

Prevent grid stride loop overflow in libcudf kernels #10368

nvdbaranec opened this issue Feb 28, 2022 · 15 comments
Labels
1 - On Deck To be worked on next bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Feb 28, 2022

(updated Aug 2023), (updated Nov 2024)

Background

We found a kernel indexing overflow issue, first discovered in the fused_concatenate kernels (#10333) and this issue is present in a number of our CUDA kernels that take the following form:

size_type output_index = threadIdx.x + blockIdx.x * blockDim.x;  
while (output_index < output_size) {
  output_index += blockDim.x * gridDim.x;
}

If we have an output_size of say 1.2 billion and a grid size that's the same, the following happens: Some late thread id, say 1.19 billion attempts to add 1.2 billion (blockDim.x * gridDim.x) and overflows the size_type (signed 32 bits).

We made a round of fixes in #10448, and then later found another instance of this error in #13838. Our first pass of investigation was not adequate to contain the issue, so we need to take another close look.

Part 1 - First pass fix kernels with this issue

Source file Kernels Status
copying/concatenate.cu fused_concatenate_kernel #10448
valid_if.cuh valid_if_kernel #10448
scatter.cu marking_bitmask_kernel #10448
replace/nulls.cu replace_nulls_strings #10448
replace/nulls.cu replace_nulls #10448
rolling/rolling_detail.cuh gpu_rolling #10448
rolling/jit/kernel.cu gpu_rolling_new #10448
transform/compute_column.cu compute_column_kernel #10448
copying/concatenate.cu fused_concatenate_string_offset_kernel #13838
replace/replace.cu replace_strings_first_pass
replace_strings_second_pass
replace_kernel
#13905
copying/concatenate.cu concatenate_masks_kernel
fused_concatenate_string_offset_kernel
fused_concatenate_string_chars_kernel
fused_concatenate_kernel (int64)
#13906
hash/helper_functions.cuh init_hashtbl #13895
null_mask.cu set_null_mask_kernel
copy_offset_bitmask
count_set_bits_kernel
#13895
transform/row_bit_count.cu compute_row_sizes #13895
multibyte_split.cu multibyte_split_init_kernel
multibyte_split_seed_kernel (auto??)
multibyte_split_kernel
#13910
IO modules: parquet, orc, json #13910
io/utilities/parsing_utils.cu count_and_set_positions (uint64_t) #13910
conditional_join_kernels.cuh compute_conditional_join_output_size
conditional_join
#13971
merge.cu materialize_merged_bitmask_kernel #13972
partitioning.cu compute_row_partition_numbers
compute_row_output_locations
copy_block_partitions
#13973
json_path.cu get_json_object_kernel #13962
tdigest compute_percentiles_kernel (int) #13962
strings/attributes.cu count_characters_parallel_fn #13968
strings/convert/convert_urls.cu url_decode_char_counter (int)
url_decode_char_replacer (int)
#13968
text/subword/data_normalizer.cu kernel_data_normalizer (uint32_t) #13915
text/subword/subword_tokenize.cu kernel_compute_tensor_metadata (uint32_t) #13915
text/subword/wordpiece_tokenizer.cu init_data_and_mark_word_start_and_ends (uint32_t)
mark_string_start_and_ends (uint32_t)
kernel_wordpiece_tokenizer (uint32_t)
#13915

Part 2 - Take another pass over more challenging kernels

Source file Kernels Status
null_mask.cuh subtract_set_bits_range_boundaries_kernel
valid_if.cuh valid_if_n_kernel
copy_if_else.cuh copy_if_else_kernel
gather.cuh gather_chars_fn_string_parallel
more? search gridDim.x or blockDim.x to find more examples

Part 2b - results of searching

Source file: line content Overflow risk (yes/no) Status
cpp/src/binaryop/compiled/binary_ops.cuh yes #17354
cpp/src/binaryop/jit/kernel.cu yes #17420
include/cudf/detail/copy_if.cuh: int tid = threadIdx.x + per_thread * block_size * blockIdx.x;
include/cudf/detail/copy_if.cuh: int tid = threadIdx.x + per_thread * block_size * blockIdx.x;
include/cudf/detail/copy_if.cuh: int const wid = threadIdx.x / cudf::detail::warp_size;
include/cudf/detail/copy_if.cuh: int const lane = threadIdx.x % cudf::detail::warp_size;
17520
include/cudf/detail/copy_if_else.cuh: int const lane_id = threadIdx.x % cudf::detail::warp_size; no 17473
include/cudf/detail/copy_range.cuh: int const lane_id = threadIdx.x % warp_size;
include/cudf/detail/copy_range.cuh: cudf::size_type const tid = threadIdx.x + blockIdx.x * blockDim.x;
include/cudf/detail/copy_range.cuh: source_idx += blockDim.x * gridDim.x;
yes #17409
include/cudf/detail/null_mask.cuh: auto const tid = threadIdx.x + blockIdx.x * blockDim.x;
include/cudf/detail/null_mask.cuh: destination_word_index += blockDim.x * gridDim.x) {
include/cudf/detail/null_mask.cuh: size_type const tid = threadIdx.x + blockIdx.x * blockDim.x;
include/cudf/detail/null_mask.cuh: range_id += blockDim.x * gridDim.x;
no
include/cudf/detail/valid_if.cuh: auto block_offset = blockIdx.x * blockDim.x;
include/cudf/detail/valid_if.cuh: auto const thread_idx = block_offset + threadIdx.x;
include/cudf/detail/valid_if.cuh: block_offset += blockDim.x * gridDim.x;
no
include/cudf/strings/detail/gather.cuh: int global_thread_id = blockIdx.x * blockDim.x + threadIdx.x;
include/cudf/strings/detail/gather.cuh: int nwarps = gridDim.x * blockDim.x / cudf::detail::warp_size;
include/cudf/strings/detail/gather.cuh: size_type begin_out_string_idx = blockIdx.x * strings_per_threadblock;
#17404
src/join/mixed_join_kernel.cuh: cudf::size_type outer_row_index = threadIdx.x + blockIdx.x * block_size;
src/join/mixed_join_size_kernel.cuh: cudf::size_type const start_idx = threadIdx.x + blockIdx.x * block_size;
#17633
src/partitioning/partitioning.cu: size_type partition_number = threadIdx.x;
src/partitioning/partitioning.cu: size_type const write_location = partition_number * gridDim.x + blockIdx.x;
src/partitioning/partitioning.cu: block_partition_offsets[partition_number * gridDim.x + blockIdx.x];
src/partitioning/partitioning.cu: if (ELEMENTS_PER_THREAD * threadIdx.x + i < num_partitions) {
src/partitioning/partitioning.cu: block_partition_sizes[blockIdx.x + (ELEMENTS_PER_THREAD * threadIdx.x + i) * gridDim.x];
src/partitioning/partitioning.cu: if (ELEMENTS_PER_THREAD * threadIdx.x + i < num_partitions) {
src/partitioning/partitioning.cu: scanned_block_partition_sizes[ipartition * gridDim.x + blockIdx.x];
yes #17473
src/strings/regex/utilities.cuh: auto const thread_idx = threadIdx.x + blockIdx.x * blockDim.x;
src/strings/regex/utilities.cuh: auto const thread_idx = threadIdx.x + blockIdx.x * blockDim.x;
#17404
src/strings/search/find.cu: size_type const idx = static_cast<size_type>(threadIdx.x + blockIdx.x * blockDim.x);
src/strings/search/find.cu: size_type const idx = static_cast<size_type>(threadIdx.x + blockIdx.x * blockDim.x);
#17404
src/text/subword/data_normalizer.cu: uint32_t* block_base = code_points + blockIdx.x * blockDim.x * MAX_NEW_CHARS; #17404
src/text/vocabulary_tokenize.cu: auto const idx = static_cast<std::size_t>(threadIdx.x + blockIdx.x * blockDim.x); #17404
src/transform/compute_column.cu: &intermediate_storage[threadIdx.x * device_expression_data.num_intermediates]; no
src/transform/jit/kernel.cu: thread_index_type const start = threadIdx.x + blockIdx.x * blockDim.x;<br>src/transform/jit/kernel.cu: thread_index_type const stride = blockDim.x * gridDim.x; #17473
src/transform/row_bit_count.cu: int const tid = threadIdx.x + blockIdx.x * blockDim.x;
src/transform/row_bit_count.cu: row_span* my_branch_stack = thread_branch_stacks + (threadIdx.x * max_branch_depth);
#17473

Additional information

There are also a number of kernels that have this pattern but probably don't ever overflow because they are indexing by bitmask words. (Example)
Additional, In this kernel, source_idx probably overflows, but harmlessly.

A snippet of code to see this in action:

size_type const size = 1200000000;
auto big = cudf::make_fixed_width_column(data_type{type_id::INT32}, size, mask_state::UNALLOCATED);  
auto x = cudf::rolling_window(*big, 1, 1, 1, cudf::detail::sum_aggregation{}); 

Note: rmm may mask out of bounds accesses in some cases, so it's helpful to run with the plain cuda allocator.

@nvdbaranec nvdbaranec added bug Something isn't working Needs Triage Need team to review and classify labels Feb 28, 2022
@nvdbaranec
Copy link
Contributor Author

The fix in basically all of these cases is quite simple: just make the index a size_t

@jrhemstad
Copy link
Contributor

I'd love to just add an algorithm to do this.

https://godbolt.org/z/hK95z7zff

@harrism
Copy link
Member

harrism commented Mar 15, 2022

Or even a simple range helper for for loops: https://github.com/harrism/hemi#simple-grid-stride-loops

@harrism
Copy link
Member

harrism commented Mar 15, 2022

The fix in basically all of these cases is quite simple: just make the index a size_t

I think the general approach should be:

  1. Use an algorithm (thrust:: or std::) if possible before ever writing a custom kernel -- this way you write a per-element device functor instead, and indexing is handled for you.
  2. If a custom kernel must be written, it should use device-side algorithms instead of raw loops.
  3. If a raw grid-stride loop is required and an existing algorithm won't work, we should provide utilities to abstract the iteration and or the range and use auto for the type to avoid these mistakes.

rapids-bot bot pushed a commit that referenced this issue Mar 23, 2022
Partially addresses  #10368

Specifically:
- `valid_if`
- `scatter`
- `rolling_window`
- `compute_column_kernel` (ast stuff)
- `replace_nulls` (fixed-width and strings)

The majority of the fixes are simply making the indexing variable a `std::size_t` instead of a `cudf::size_type`.  Although scatter had an additional place it was overflowing outside the kernel.

I didn't add tests for these fixes, but each of them were individually tested locally to make sure they actually manifested the issue and then were verified with the fixes.

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Mark Harris (https://github.com/harrism)
  - Nghia Truong (https://github.com/ttnghia)

URL: #10448
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@nvdbaranec
Copy link
Contributor Author

Still relevant.

@harrism
Copy link
Member

harrism commented Apr 26, 2022

Created https://github.com/harrism/ranger as a solution to this. Needs to be moved into libcudf.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@nvdbaranec
Copy link
Contributor Author

Still relevant.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@nvdbaranec
Copy link
Contributor Author

Still relevant.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 28, 2022
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball
Copy link
Contributor

Thanks @harrism for creating the ranger repo! Do you think we are ready to kick off an integration with libcudf, or does ranger need more development first?

@wence-
Copy link
Contributor

wence- commented Aug 10, 2023

wrt attempting to find locations where this might be happening. In host code, clang and gcc will warn if you add -Wsign-conversion (not covered by -Wall -Wextra) under some circumstances. Unfortunately there is no such option for nvcc.

#include <cstdint>
int what(int upper)
{
  int i = 0; // no warning if this is a std::int64_t
  unsigned int stride = 10;
  while (i < upper) {
    i = i + stride; // clang warns for this, so does gcc
  }
  i = 0;
  while (i < upper) {
    i += stride; // gcc warns for this, clang does not.
  }
  return i;
}

rapids-bot bot pushed a commit that referenced this issue Sep 1, 2023
This PR adds `grid_1d::grid_stride()` and uses it in a handful of kernels. Follow-up to #13910, which added a `grid_1d::global_thread_id()`. We'll need to do a later PR that catches any missing instances where this should be used, since there are a large number of PRs in flight touching thread indexing code in various files. See #10368.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #13996
rapids-bot bot pushed a commit that referenced this issue Sep 7, 2023
…joins (#13971)

See #10368 (and more recently #13771

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - David Wendt (https://github.com/davidwendt)

URL: #13971
@GregoryKimball GregoryKimball removed the status in libcudf Oct 26, 2023
@GregoryKimball GregoryKimball moved this to To be revisited in libcudf Oct 26, 2023
@vyasr vyasr removed the helps: Dask label Feb 23, 2024
rapids-bot bot pushed a commit that referenced this issue May 7, 2024
This PR refactors a few kernels to use `thread_index_type` and associated utilities. I started this before realizing how much scope was still left in issue #10368 ("Part 2 - Take another pass over more challenging kernels"), and then I stopped working on this due to time constraints. For the moment, I hope this PR makes a small dent in the number of remaining kernels to convert to using `thread_index_type`.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Mark Harris (https://github.com/harrism)
  - David Wendt (https://github.com/davidwendt)

URL: #14107
wence- added a commit to wence-/cudf that referenced this issue Nov 18, 2024
For large columns, the computed stride might end up overflowing
size_type. To fix this, use the grid_1d helper. See also rapidsai#10368.

- Closes rapidsai#17353
@GregoryKimball GregoryKimball changed the title Use "ranger" to prevent grid stride loop overflow Prevent grid stride loop overflow in libcudf kernels Nov 19, 2024
rapids-bot bot pushed a commit that referenced this issue Nov 19, 2024
For large columns, the computed stride might end up overflowing size_type. To fix this, use the grid_1d helper. See also #10368.

- Closes #17353

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Nghia Truong (https://github.com/ttnghia)

URL: #17354
rapids-bot bot pushed a commit that referenced this issue Nov 26, 2024
…culations (#17404)

Replaces `threadIdx.x + blockDim.x * blockIdx.x` logic with `grid_1d::global_thread_id()` and `blockDim.x * gridDim.x` with `grid_1d::grid_stride()` in libcudf strings and text source.

Reference #10368

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #17404
@GregoryKimball GregoryKimball moved this from To be revisited to In progress in libcudf Dec 2, 2024
rapids-bot bot pushed a commit that referenced this issue Dec 11, 2024
Fixes some possible thread-id calculations or usages that may possibly overflow `int32` type or `size_type`.
Reference #10368

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Shruti Shivakumar (https://github.com/shrshi)

URL: #17473
rapids-bot bot pushed a commit that referenced this issue Dec 20, 2024
Fixes possible integer overflow condition when the number of rows is near max int32 in `compute_mixed_join_output_size` kernel function.
Reference #10368

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #17633
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: In progress
Development

No branches or pull requests

6 participants