-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent grid stride loop overflow in libcudf kernels #10368
Comments
The fix in basically all of these cases is quite simple: just make the index a |
I'd love to just add an algorithm to do this. |
Or even a simple range helper for |
I think the general approach should be:
|
Partially addresses #10368 Specifically: - `valid_if` - `scatter` - `rolling_window` - `compute_column_kernel` (ast stuff) - `replace_nulls` (fixed-width and strings) The majority of the fixes are simply making the indexing variable a `std::size_t` instead of a `cudf::size_type`. Although scatter had an additional place it was overflowing outside the kernel. I didn't add tests for these fixes, but each of them were individually tested locally to make sure they actually manifested the issue and then were verified with the fixes. Authors: - https://github.com/nvdbaranec Approvers: - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) URL: #10448
This issue has been labeled |
Still relevant. |
Created https://github.com/harrism/ranger as a solution to this. Needs to be moved into libcudf. |
This issue has been labeled |
Still relevant. |
This issue has been labeled |
Still relevant. |
This issue has been labeled |
wrt attempting to find locations where this might be happening. In host code, clang and gcc will warn if you add #include <cstdint>
int what(int upper)
{
int i = 0; // no warning if this is a std::int64_t
unsigned int stride = 10;
while (i < upper) {
i = i + stride; // clang warns for this, so does gcc
}
i = 0;
while (i < upper) {
i += stride; // gcc warns for this, clang does not.
}
return i;
} |
This PR adds `grid_1d::grid_stride()` and uses it in a handful of kernels. Follow-up to #13910, which added a `grid_1d::global_thread_id()`. We'll need to do a later PR that catches any missing instances where this should be used, since there are a large number of PRs in flight touching thread indexing code in various files. See #10368. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13996
…joins (#13971) See #10368 (and more recently #13771 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - David Wendt (https://github.com/davidwendt) URL: #13971
This PR refactors a few kernels to use `thread_index_type` and associated utilities. I started this before realizing how much scope was still left in issue #10368 ("Part 2 - Take another pass over more challenging kernels"), and then I stopped working on this due to time constraints. For the moment, I hope this PR makes a small dent in the number of remaining kernels to convert to using `thread_index_type`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - MithunR (https://github.com/mythrocks) - Mark Harris (https://github.com/harrism) - David Wendt (https://github.com/davidwendt) URL: #14107
For large columns, the computed stride might end up overflowing size_type. To fix this, use the grid_1d helper. See also rapidsai#10368. - Closes rapidsai#17353
For large columns, the computed stride might end up overflowing size_type. To fix this, use the grid_1d helper. See also #10368. - Closes #17353 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) - Tianyu Liu (https://github.com/kingcrimsontianyu) - Muhammad Haseeb (https://github.com/mhaseeb123) - Nghia Truong (https://github.com/ttnghia) URL: #17354
…culations (#17404) Replaces `threadIdx.x + blockDim.x * blockIdx.x` logic with `grid_1d::global_thread_id()` and `blockDim.x * gridDim.x` with `grid_1d::grid_stride()` in libcudf strings and text source. Reference #10368 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #17404
Fixes some possible thread-id calculations or usages that may possibly overflow `int32` type or `size_type`. Reference #10368 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Shruti Shivakumar (https://github.com/shrshi) URL: #17473
Fixes possible integer overflow condition when the number of rows is near max int32 in `compute_mixed_join_output_size` kernel function. Reference #10368 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #17633
(updated Aug 2023), (updated Nov 2024)
Background
We found a kernel indexing overflow issue, first discovered in the
fused_concatenate
kernels (#10333) and this issue is present in a number of our CUDA kernels that take the following form:If we have an output_size of say 1.2 billion and a grid size that's the same, the following happens: Some late thread id, say 1.19 billion attempts to add 1.2 billion (blockDim.x * gridDim.x) and overflows the size_type (signed 32 bits).
We made a round of fixes in #10448, and then later found another instance of this error in #13838. Our first pass of investigation was not adequate to contain the issue, so we need to take another close look.
Part 1 - First pass fix kernels with this issue
copying/concatenate.cu
fused_concatenate_kernel
valid_if.cuh
valid_if_kernel
scatter.cu
marking_bitmask_kernel
replace/nulls.cu
replace_nulls_strings
replace/nulls.cu
replace_nulls
rolling/rolling_detail.cuh
gpu_rolling
rolling/jit/kernel.cu
gpu_rolling_new
transform/compute_column.cu
compute_column_kernel
copying/concatenate.cu
fused_concatenate_string_offset_kernel
replace/replace.cu
replace_strings_first_pass
replace_strings_second_pass
replace_kernel
copying/concatenate.cu
concatenate_masks_kernel
fused_concatenate_string_offset_kernel
fused_concatenate_string_chars_kernel
fused_concatenate_kernel
(int64)hash/helper_functions.cuh
init_hashtbl
null_mask.cu
set_null_mask_kernel
copy_offset_bitmask
count_set_bits_kernel
transform/row_bit_count.cu
compute_row_sizes
multibyte_split.cu
multibyte_split_init_kernel
multibyte_split_seed_kernel
(auto??)multibyte_split_kernel
io/utilities/parsing_utils.cu
count_and_set_positions
(uint64_t)conditional_join_kernels.cuh
compute_conditional_join_output_size
conditional_join
merge.cu
materialize_merged_bitmask_kernel
partitioning.cu
compute_row_partition_numbers
compute_row_output_locations
copy_block_partitions
json_path.cu
get_json_object_kernel
tdigest
compute_percentiles_kernel
(int)strings/attributes.cu
count_characters_parallel_fn
strings/convert/convert_urls.cu
url_decode_char_counter
(int)url_decode_char_replacer
(int)text/subword/data_normalizer.cu
kernel_data_normalizer
(uint32_t)text/subword/subword_tokenize.cu
kernel_compute_tensor_metadata
(uint32_t)text/subword/wordpiece_tokenizer.cu
init_data_and_mark_word_start_and_ends
(uint32_t)mark_string_start_and_ends
(uint32_t)kernel_wordpiece_tokenizer
(uint32_t)Part 2 - Take another pass over more challenging kernels
gridDim.x
orblockDim.x
to find more examplesPart 2b - results of searching
cpp/src/binaryop/compiled/binary_ops.cuh
cpp/src/binaryop/jit/kernel.cu
include/cudf/detail/copy_if.cuh: int tid = threadIdx.x + per_thread * block_size * blockIdx.x;
include/cudf/detail/copy_if.cuh: int tid = threadIdx.x + per_thread * block_size * blockIdx.x;
include/cudf/detail/copy_if.cuh: int const wid = threadIdx.x / cudf::detail::warp_size;
include/cudf/detail/copy_if.cuh: int const lane = threadIdx.x % cudf::detail::warp_size;
include/cudf/detail/copy_if_else.cuh: int const lane_id = threadIdx.x % cudf::detail::warp_size;
include/cudf/detail/copy_range.cuh: int const lane_id = threadIdx.x % warp_size;
include/cudf/detail/copy_range.cuh: cudf::size_type const tid = threadIdx.x + blockIdx.x * blockDim.x;
include/cudf/detail/copy_range.cuh: source_idx += blockDim.x * gridDim.x;
include/cudf/detail/null_mask.cuh: auto const tid = threadIdx.x + blockIdx.x * blockDim.x;
include/cudf/detail/null_mask.cuh: destination_word_index += blockDim.x * gridDim.x) {
include/cudf/detail/null_mask.cuh: size_type const tid = threadIdx.x + blockIdx.x * blockDim.x;
include/cudf/detail/null_mask.cuh: range_id += blockDim.x * gridDim.x;
include/cudf/detail/valid_if.cuh: auto block_offset = blockIdx.x * blockDim.x;
include/cudf/detail/valid_if.cuh: auto const thread_idx = block_offset + threadIdx.x;
include/cudf/detail/valid_if.cuh: block_offset += blockDim.x * gridDim.x;
include/cudf/strings/detail/gather.cuh: int global_thread_id = blockIdx.x * blockDim.x + threadIdx.x;
include/cudf/strings/detail/gather.cuh: int nwarps = gridDim.x * blockDim.x / cudf::detail::warp_size;
include/cudf/strings/detail/gather.cuh: size_type begin_out_string_idx = blockIdx.x * strings_per_threadblock;
src/join/mixed_join_kernel.cuh: cudf::size_type outer_row_index = threadIdx.x + blockIdx.x * block_size;
src/join/mixed_join_size_kernel.cuh: cudf::size_type const start_idx = threadIdx.x + blockIdx.x * block_size;
src/partitioning/partitioning.cu: size_type partition_number = threadIdx.x;
src/partitioning/partitioning.cu: size_type const write_location = partition_number * gridDim.x + blockIdx.x;
src/partitioning/partitioning.cu: block_partition_offsets[partition_number * gridDim.x + blockIdx.x];
src/partitioning/partitioning.cu: if (ELEMENTS_PER_THREAD * threadIdx.x + i < num_partitions) {
src/partitioning/partitioning.cu: block_partition_sizes[blockIdx.x + (ELEMENTS_PER_THREAD * threadIdx.x + i) * gridDim.x];
src/partitioning/partitioning.cu: if (ELEMENTS_PER_THREAD * threadIdx.x + i < num_partitions) {
src/partitioning/partitioning.cu: scanned_block_partition_sizes[ipartition * gridDim.x + blockIdx.x];
src/strings/regex/utilities.cuh: auto const thread_idx = threadIdx.x + blockIdx.x * blockDim.x;
src/strings/regex/utilities.cuh: auto const thread_idx = threadIdx.x + blockIdx.x * blockDim.x;
src/strings/search/find.cu: size_type const idx = static_cast<size_type>(threadIdx.x + blockIdx.x * blockDim.x);
src/strings/search/find.cu: size_type const idx = static_cast<size_type>(threadIdx.x + blockIdx.x * blockDim.x);
src/text/subword/data_normalizer.cu: uint32_t* block_base = code_points + blockIdx.x * blockDim.x * MAX_NEW_CHARS;
src/text/vocabulary_tokenize.cu: auto const idx = static_cast<std::size_t>(threadIdx.x + blockIdx.x * blockDim.x);
src/transform/compute_column.cu: &intermediate_storage[threadIdx.x * device_expression_data.num_intermediates];
src/transform/jit/kernel.cu: thread_index_type const start = threadIdx.x + blockIdx.x * blockDim.x;<br>src/transform/jit/kernel.cu: thread_index_type const stride = blockDim.x * gridDim.x;
src/transform/row_bit_count.cu: int const tid = threadIdx.x + blockIdx.x * blockDim.x;
src/transform/row_bit_count.cu: row_span* my_branch_stack = thread_branch_stacks + (threadIdx.x * max_branch_depth);
Additional information
There are also a number of kernels that have this pattern but probably don't ever overflow because they are indexing by bitmask words. (Example)
Additional, In this kernel,
source_idx
probably overflows, but harmlessly.A snippet of code to see this in action:
Note: rmm may mask out of bounds accesses in some cases, so it's helpful to run with the plain cuda allocator.
The text was updated successfully, but these errors were encountered: