Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add string.find_multiple APIs to pylibcudf #16824

Conversation

mroeschke
Copy link
Contributor

Description

Contributes to #15162

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package labels Sep 18, 2024
@mroeschke mroeschke requested a review from a team as a code owner September 18, 2024 00:22
@mroeschke mroeschke requested review from isVoid and Matt711 September 18, 2024 00:22
@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue and removed pylibcudf Issues specific to the pylibcudf package labels Sep 18, 2024
@Matt711 Matt711 added the pylibcudf Issues specific to the pylibcudf package label Sep 19, 2024
…i#16886)

For releases, since the polars release cadence is quite a lot faster than rapids, we propose to hard-pin to a known good version. In this case, 1.8.x.

At the same time, remove pin in CI scripts and update list of xfailing tests in the polars test suite.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai#16886
@Matt711 Matt711 changed the base branch from branch-24.10 to branch-24.12 September 25, 2024 17:43
brandon-b-miller and others added 8 commits September 25, 2024 17:52
This PR adds `cudf-polars` to the top level build script.

Authors:
  - https://github.com/brandon-b-miller
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Jake Awe (https://github.com/AyodeAwe)

URL: rapidsai#16898
Before when `columns=` was a `cudf.Series/Index` we would call `return array.unique.to_pandas()`, but `.unique` is a method not a property so this would have raised an error.

Also took the time to refactor the helper methods here and push down the `errors=` keyword to `Frame._drop_column`

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#16712
This PR is a first pass at rapidsai#15937. We will close rapidsai#15937 after rapidsai#15162 is closed

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: rapidsai#16810
Fixes rapidsai#16625

This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation.  Previously it was doing a linear march over the individual tdigests per group and merging them one by one.  This led to terrible performance for large numbers of groups.  In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to.  Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x,

```
Old
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        7473 ms         7472 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        7433 ms         7431 ms            8
```


```
New
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        6.72 ms         6.79 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        1.24 ms         1.32 ms            8
```

Authors:
  - https://github.com/nvdbaranec
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#16780
This PR displays delta's for CPU and GPU usage metrics that are extracted from `cudf.pandas` pytests.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)

URL: rapidsai#16864
…apidsai#15979)

Part of rapidsai#15903.
1. Introduces the Compressed Sparse Row (CSR) format to store the adjacency information of the column tree. 
2. Analogous to `reduce_to_column_tree`, `reduce_to_column_tree_csr` reduces node tree representation to column tree stored in CSR format.

TODO:
- [x] Correctness test

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

URL: rapidsai#15979
This PR is a first pass of making tests deterministic, I noticed one of CI job failed due to an overflow error related to random data generation.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Richard (Rick) Zamora (https://github.com/rjzamora)

URL: rapidsai#16910
@mroeschke mroeschke requested review from a team as code owners September 25, 2024 21:20
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. cudf.polars Issues specific to cudf.polars labels Sep 25, 2024
@mroeschke mroeschke closed this Sep 25, 2024
@mroeschke mroeschke deleted the pylibcudf/strings/find_multiple branch September 25, 2024 21:31
rapids-bot bot pushed a commit that referenced this pull request Oct 2, 2024
Redo at #16824

Contributes to #15162

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Matthew Murray (https://github.com/Matt711)

URL: #16920
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cudf.polars Issues specific to cudf.polars improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants