Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudf::strings::contains_multiple #16900

Merged
merged 86 commits into from
Nov 12, 2024

Conversation

davidwendt
Copy link
Contributor

Description

Add new cudf::strings::contains_multiple API to search multiple targets within a strings column.
Output is a table where the number of columns is the number of targets and each row is a boolean indicating that target was found at the row or not.
This PR is to help in collaboration with #16641

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Sep 24, 2024
@davidwendt davidwendt self-assigned this Sep 24, 2024
@github-actions github-actions bot added the CMake CMake build issue label Sep 24, 2024
@davidwendt davidwendt mentioned this pull request Sep 30, 2024
3 tasks
@Matt711 Matt711 assigned Matt711 and unassigned Matt711 Sep 30, 2024
@res-life
Copy link
Contributor

res-life commented Nov 8, 2024

This PR has better performance compared to the previous PR.
Below is nsys info and end to end time:

previous PR:

Time  Total Time  Instances Avg Med Min Max StdDev  Name
27.0%	19.388 s	792	24.480 ms	18.258 ms	2.157 ms	79.471 ms	22.218 ms	cudf::strings::detail::<unnamed>::multi_contains_warp_parallel_multi_scalars_fn(cudf::column_device_view, cudf::column_device_view, cudf::device_span<char, (unsigned long)18446744073709551615>, cudf::column_device_view, cudf::device_span<bool *, (unsigned long)18446744073709551615>)

End to End time is: 53s

this PR

Time	Total Time	Instances	Avg	Med	Min	Max	StdDev	Name
16.9%	11.033 s	528	20.896 ms	14.290 ms	1.808 ms	63.529 ms	18.714 ms	void cudf::strings::detail::<unnamed>::multi_contains_kernel<(long)32>(cudf::column_device_view, cudf::column_device_view, const unsigned char *, const int *, const int *, int, bool *, cudf::device_span<bool *, (unsigned long)18446744073709551615>)

End to End time is: 46s

@davidwendt davidwendt requested a review from mythrocks November 8, 2024 21:19
@mythrocks
Copy link
Contributor

Sorry for the delay. It took a while to get my head around it. Thank you, this is an impressive speedup.

@res-life
Copy link
Contributor

@davidwendt Other PRs are depending on this, could you merge this PR?

@davidwendt
Copy link
Contributor Author

/merge

1 similar comment
@davidwendt
Copy link
Contributor Author

/merge

@res-life
Copy link
Contributor

/merge

Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. Thanks!

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will apply one small docstring fix I saw while reviewing C++, then I can approve the CMake (I didn't want to approve yet because it would auto-merge since /merge has already been added).

cpp/include/cudf/strings/find_multiple.hpp Outdated Show resolved Hide resolved
@rapids-bot rapids-bot bot merged commit 796de4b into rapidsai:branch-24.12 Nov 12, 2024
102 checks passed
@davidwendt davidwendt deleted the contains-multiple branch November 12, 2024 17:32
rapids-bot bot pushed a commit that referenced this pull request Nov 14, 2024
This is Java JNI interface for [multiple contains PR](#16900)

Authors:
  - Chong Gao (https://github.com/res-life)

Approvers:
  - Alessandro Bellina (https://github.com/abellina)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #17281
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
Status: Landed
Development

Successfully merging this pull request may close these issues.

8 participants