Release [NIGHTLY] v24.12.00 · rapidsai/cudf

🔗 Links

🚨 Breaking Changes

Fix reading Parquet string cols when nrows and input_pass_limit > 0 (#17321) @mhaseeb123
prefer wheel-provided libcudf.so in load_library(), use RTLD_LOCAL (#17316) @jameslamb
Deprecate single component extraction methods in libcudf (#17221) @Matt711
Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
Refactor Dask cuDF legacy code (#17205) @rjzamora
Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
Remove java reservation (#17189) @revans2
Separate evaluation logic from IR objects in cudf-polars (#17175) @rjzamora
Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
Correctly set is_device_accesible when creating host_spans from other container/span types (#17079) @vuule
Unify treatment of Expr and IR nodes in cudf-polars DSL (#17016) @wence-
Deprecate support for directly accessing logger (#16964) @vyasr
Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr

🐛 Bug Fixes

Ignore errors when testing glibc versions (#17389) @vyasr
Adapt to KvikIO API change in the compatibility mode (#17377) @kingcrimsontianyu
Support pivot with index or column arguments as lists (#17373) @mroeschke
Deselect failing polars tests (#17362) @pentschev
Fix integer overflow in compiled binaryop (#17354) @wence-
Update cmake to 3.28.6 in JNI Dockerfile (#17342) @jlowe
fix library-loading issues in editable installs (#17338) @jameslamb
Bug fix: restrict lines=True to JSON format in Kafka read_gdf method (#17333) @a-hirota
Fix various issues with replace API and add support in datetime and timedelta columns (#17331) @galipremsagar
Do not exclude nanoarrow and flatbuffers from installation if statically linked (#17322) @hyperbolic2346
Fix reading Parquet string cols when nrows and input_pass_limit > 0 (#17321) @mhaseeb123
Remove another reference to FindcuFile (#17315) @KyleFromNVIDIA
Fix reading of single-row unterminated CSV files (#17305) @vuule
Fixed lifetime issue in ast transform tests (#17292) @lamarrr
Switch to using TaskSpec (#17285) @galipremsagar
Fix data_type ctor call in JSON_TEST (#17273) @davidwendt
Expose delimiter character in JSON reader options to JSON reader APIs (#17266) @shrshi
Fix extract-datetime deprecation warning in ndsh benchmark (#17254) @davidwendt
Disallow cuda-python 12.6.1 and 11.8.4 (#17253) @bdice
Wrap custom iterator result (#17251) @galipremsagar
Fix binop with LHS numpy datetimelike scalar (#17226) @mroeschke
Fix Dataframe.__setitem__ slow-downs (#17222) @galipremsagar
Fix groupby.get_group with length-1 tuple with list-like grouper (#17216) @mroeschke
Fix discoverability of submodules inside pd.util (#17215) @galipremsagar
Fix Schema.Builder does not propagate precision value to Builder instance (#17214) @ttnghia
Mark column chunks in a PQ reader pass as large strings when the cumulative offsets exceeds the large strings threshold. (#17207) @mhaseeb123
[BUG] Replace repo_token with github_token in Auto Assign PR GHA (#17203) @Matt711
Remove unsanitized nulls from input strings columns in reduction gtests (#17202) @davidwendt
Fix to_parquet append behavior with global metadata file (#17198) @rjzamora
Check num_children() == 0 in Column.from_column_view (#17193) @cwharris
Fix host-to-device copy missing sync in strings/duration convert (#17149) @davidwendt
Add JNI Support for Multi-line Delimiters and Include Test (#17139) @SurajAralihalli
Ignore loud dask warnings about legacy dataframe implementation (#17137) @galipremsagar
Fix the GDS read/write segfault/bus error when the cuFile policy is set to GDS or ALWAYS (#17122) @kingcrimsontianyu
Fix DataFrame._from_arrays and introduce validations (#17112) @galipremsagar
[Bug] Fix Arrow-FS parquet reader for larger files (#17099) @rjzamora
Fix bug in recovering invalid lines in JSONL inputs (#17098) @shrshi
Reenable huge pages for arrow host copying (#17097) @vyasr
Correctly set is_device_accesible when creating host_spans from other container/span types (#17079) @vuule
Fix ORC reader when using device_read_async while the destination device buffers are not ready (#17074) @ttnghia
Fix regex handling of fixed quantifier with 0 range (#17067) @davidwendt
Limit the number of keys to calculate column sizes and page starts in PQ reader to 1B (#17059) @mhaseeb123
Adding assertion to check for regular JSON inputs of size greater than INT_MAX bytes (#17057) @shrshi
bug fix: use self.ck_consumer in poll method of kafka.py to align with __init__ (#17044) @a-hirota
Disable kvikio remote I/O to avoid openssl dependencies in JNI build (#17026) @pxLi
Fix host_span constructor to correctly copy is_device_accessible (#17020) @vuule
Add pinning for pyarrow in wheels (#17018) @vyasr
Use std::optional for host types (#17015) @robertmaynard
Fix write_json to handle empty string column (#16995) @karthikeyann
Restore export of nvcomp outside of wheel builds (#16988) @KyleFromNVIDIA
Allow melt(var_name=) to be a falsy label (#16981) @mroeschke
Fix astype from tz-aware type to tz-aware type (#16980) @mroeschke
Use libcudf wheel from PR rather than nightly for polars-polars CI test job (#16975) @brandon-b-miller
Fix order-preservation in pandas-compat unsorted groupby (#16942) @wence-
Fix cudf::strings::findall error with empty input (#16928) @davidwendt
Fix JsonLargeReaderTest.MultiBatch use of LIBCUDF_JSON_BATCH_SIZE env var (#16927) @davidwendt
Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16923) @shrshi
Respect groupby.nunique(dropna=False) (#16921) @mroeschke
Update all rmm imports to use pylibrmm/librmm (#16913) @Matt711
Fix order-preservation in cudf-polars groupby (#16907) @wence-
Add a shortcut for when the input clusters are all empty for the tdigest merge (#16897) @jihoonson
Properly handle the mapped and registered regions in memory_mapped_source (#16865) @vuule
Fix performance regression for generate_character_ngrams (#16849) @davidwendt
Fix regex parsing logic handling of nested quantifiers (#16798) @davidwendt
Compute whole column variance using numerically stable approach (#16448) @wence-

📖 Documentation

Add documentation for low memory readers (#17314) @btepera
Fix the example in documentation for get_dremel_data() (#17242) @mhaseeb123
Fix some documentation rendering for pylibcudf (#17217) @mroeschke
Move detail header floating_conversion.hpp to detail subdirectory (#17209) @davidwendt
Add TokenizeVocabulary to api docs (#17208) @davidwendt
Add jaccard_index to generated cuDF docs (#17199) @davidwendt
[no ci] Add empty-columns section to the libcudf developer guide (#17183) @davidwendt
Add 2-cpp approvers text to contributing guide [no ci] (#17182) @davidwendt
Changing developer guide int_64_t to int64_t (#17130) @hyperbolic2346
docs: change 'CSV' to 'csv' in python/custreamz/README.md to match kafka.py (#17041) @a-hirota
[DOC] Document limitation using cudf.pandas proxy arrays (#16955) @Matt711
[DOC] Document environment variable for failing on fallback in cudf.pandas (#16932) @Matt711

🚀 New Features

Add version config (#17312) @vyasr
Java JNI for Multiple contains (#17281) @res-life
Add cudf::calendrical_month_sequence to pylibcudf (#17277) @Matt711
Raise errors on specific types of fallback in cudf.pandas (#17268) @Matt711
Add catboost to the third-party integration tests (#17267) @Matt711
Add type stubs for pylibcudf (#17258) @wence-
Use pylibcudf contiguous split APIs in cudf python (#17246) @Matt711
Upgrade nvcomp to 4.1.0.6 (#17201) @bdice
Added Arrow Interop Benchmarks (#17194) @lamarrr
Rewrite Java API Table.readJSON to return the output from libcudf read_json directly (#17180) @ttnghia
Support storing precision of decimal types in Schema class (#17176) @ttnghia
Migrate CSV writer to pylibcudf (#17163) @Matt711
Add compute_shared_memory_aggs used by shared memory groupby (#17162) @PointKernel
Added ast tree to simplify expression lifetime management (#17156) @lamarrr
Add compute_mapping_indices used by shared memory groupby (#17147) @PointKernel
Add remaining datetime APIs to pylibcudf (#17143) @Matt711
Added strings AST vs BINARY_OP benchmarks (#17128) @lamarrr
Use libcudf_exception_handler throughout pylibcudf.libcudf (#17109) @brandon-b-miller
Include timezone file path in error message (#17102) @bdice
Migrate NVText Byte Pair Encoding APIs to pylibcudf (#17101) @Matt711
Migrate NVText Tokenizing APIs to pylibcudf (#17100) @Matt711
Migrate NVtext subword tokenizing APIs to pylibcudf (#17096) @Matt711
Migrate NVText Stemming APIs to pylibcudf (#17085) @Matt711
Migrate NVText Replacing APIs to pylibcudf (#17084) @Matt711
Add IWYU to CI (#17078) @vyasr
cudf-polars string/numeric casting (#17076) @brandon-b-miller
Migrate NVText Normalizing APIs to Pylibcudf (#17072) @Matt711
Migrate remaining nvtext NGrams APIs to pylibcudf (#17070) @Matt711
Add profilers to CUDA 12 conda devcontainers (#17066) @vyasr
Add conda recipe for cudf-polars (#17037) @bdice
Implement batch construction for strings columns (#17035) @ttnghia
Add device aggregators used by shared memory groupby (#17031) @PointKernel
Add optional column_order in JSON reader (#17029) @karthikeyann
Migrate Min Hashing APIs to pylibcudf (#17021) @Matt711
Reorganize cudf_polars expression code (#17014) @brandon-b-miller
Migrate nvtext jaccard API to pylibcudf (#17007) @Matt711
Migrate nvtext generate_ngrams APIs to pylibcudf (#17006) @Matt711
Control whether a file data source memory-maps the file with an environment variable (#17004) @vuule
Switched BINARY_OP Benchmarks from GoogleBench to NVBench (#16963) @lamarrr
[FEA] Report all unsupported operations for a query in cudf.polars (#16960) @Matt711
[FEA] Migrate nvtext/edit_distance APIs to pylibcudf (#16957) @Matt711
Switched AST benchmarks from GoogleBench to NVBench (#16952) @lamarrr
Extend device_scalar to optionally use pinned bounce buffer (#16947) @vuule
Implement cudf-polars chunked parquet reading (#16944) @brandon-b-miller
Expose streams in public round APIs (#16925) @Matt711
add telemetry setup to test (#16924) @msarahan
Add cudf::strings::contains_multiple (#16900) @davidwendt
Made cudftestutil header-only and removed GTest dependency (#16839) @lamarrr
Add an example to demonstrate multithreaded read_parquet pipelines (#16828) @mhaseeb123
Implement extract_datetime_component in libcudf/pylibcudf (#16776) @brandon-b-miller
Add cudf::strings::find_re API (#16742) @davidwendt
Migrate hashing operations to pylibcudf (#15418) @brandon-b-miller

🛠️ Improvements

Add pynvml as a dependency for dask-cudf (#17386) @pentschev
Enable unified memory by default in cudf_polars (#17375) @galipremsagar
Support polars 1.14 (#17355) @wence-
Remove cudf._lib.quantiles in favor of inlining pylibcudf (#17347) @mroeschke
Remove cudf._lib.labeling in favor of inlining pylibcudf (#17346) @mroeschke
Remove cudf._lib.hash in favor of inlining pylibcudf (#17345) @mroeschke
Remove cudf._lib.concat in favor of inlining pylibcudf (#17344) @mroeschke
Extract GPUEngine config options at translation time (#17339) @rjzamora
Update java datetime APIs to match CUDF. (#17329) @revans2
Move strings url_decode benchmarks to nvbench (#17328) @davidwendt
Move strings translate benchmarks to nvbench (#17325) @davidwendt
Writing compressed output using JSON writer (#17323) @shrshi
Test the full matrix for polars and dask wheels on nightlies (#17320) @vyasr
Remove cudf._lib.avro in favor of inlining pylicudf (#17319) @mroeschke
Move cudf._lib.unary to cudf.core._internals (#17318) @mroeschke
prefer wheel-provided libcudf.so in load_library(), use RTLD_LOCAL (#17316) @jameslamb
Clean up misc, unneeded pylibcudf.libcudf in cudf._lib (#17309) @mroeschke
Exclude nanoarrow and flatbuffers from installation (#17308) @vyasr
Update CI jobs to include Polars in nightlies and improve IWYU (#17306) @vyasr
Move strings repeat benchmarks to nvbench (#17304) @davidwendt
Fix synchronization bug in bool parquet mukernels (#17302) @pmattione-nvidia
Move strings replace benchmarks to nvbench (#17301) @davidwendt
Support polars 1.13 (#17299) @wence-
Replace FindcuFile with upstream FindCUDAToolkit support (#17298) @KyleFromNVIDIA
Expose stream-ordering in public transpose API (#17294) @shrshi
Replace workaround of JNI build with CUDF_KVIKIO_REMOTE_IO=OFF (#17293) @pxLi
cmake option: CUDF_KVIKIO_REMOTE_IO (#17291) @madsbk
Use more pylibcudf Python enums in cudf._lib (#17288) @mroeschke
Use pylibcudf enums in cudf Python quantile (#17287) @mroeschke
enforce wheel size limits, README formatting in CI (#17284) @jameslamb
Use numba-cuda<0.0.18 (#17280) @gmarkall
Add compute_column_expression to pylibcudf for transform.compute_column (#17279) @mroeschke
Optimize distinct inner join to use set find instead of retrieve (#17278) @PointKernel
remove WheelHelpers.cmake (#17276) @jameslamb
Plumb pylibcudf datetime APIs through cudf python (#17275) @Matt711
Follow up making Python tests more deterministic (#17272) @mroeschke
Use pylibcudf.search APIs in cudf python (#17271) @Matt711
Use pylibcudf.strings.convert.convert_integers.is_integer in cudf python (#17270) @Matt711
Move strings filter benchmarks to nvbench (#17269) @davidwendt
Make constructor of DeviceMemoryBufferView public (#17265) @liurenjie1024
Put a ceiling on cuda-python (#17264) @jameslamb
Always prefer device_reads and device_writes when kvikIO is enabled (#17260) @vuule
Expose streams in public quantile APIs (#17257) @shrshi
Add support for pyarrow-18 (#17256) @galipremsagar
Move strings/numeric convert benchmarks to nvbench (#17255) @davidwendt
Add new dask_cudf.read_parquet API (#17250) @rjzamora
Add read_parquet_metadata to pylibcudf (#17245) @mroeschke
Search for kvikio with lowercase (#17243) @vyasr
KvikIO shared library (#17239) @madsbk
Use more pylibcudf.io.types enums in cudf._libs (#17237) @mroeschke
Expose mixed and conditional joins in pylibcudf (#17235) @wence-
Add io.text APIs to pylibcudf (#17232) @mroeschke
Add num_iterations axis to the multi-threaded Parquet benchmarks (#17231) @vuule
Move strings to date/time types benchmarks to nvbench (#17229) @davidwendt
Support for polars 1.12 in cudf-polars (#17227) @wence-
Allow generating large strings in benchmarks (#17224) @davidwendt
Refactor gather/scatter benchmarks for strings (#17223) @davidwendt
Deprecate single component extraction methods in libcudf (#17221) @Matt711
Remove nvtext::load_vocabulary from pylibcudf (#17220) @Matt711
Benchmarking JSON reader for compressed inputs (#17219) @shrshi
Expose stream-ordering in partitioning API (#17213) @shrshi
Move strings::concatenate benchmark to nvbench (#17211) @davidwendt
Expose stream-ordering in subword tokenizer API (#17206) @shrshi
Refactor Dask cuDF legacy code (#17205) @rjzamora
Make HostMemoryBuffer call into the DefaultHostMemoryAllocator (#17204) @revans2
Unified binary_ops and ast benchmarks parameter names (#17200) @lamarrr
Add in new java API for raw host memory allocation (#17197) @revans2
Remove java reservation (#17189) @revans2
Fixed unused attribute compilation error for GCC 13 (#17188) @lamarrr
Change default KvikIO parameters in cuDF: set the thread pool size to 4, and compatibility mode to ON (#17185) @kingcrimsontianyu
Use make_device_uvector instead of cudaMemcpyAsync in inplace_bitmask_binop (#17181) @davidwendt
Make ai.rapids.cudf.HostMemoryBuffer#copyFromStream public. (#17179) @liurenjie1024
Separate evaluation logic from IR objects in cudf-polars (#17175) @rjzamora
Move nvtext ngrams benchmarks to nvbench (#17173) @davidwendt
Remove includes suggested by include-what-you-use (#17170) @vyasr
Reading multi-source compressed JSONL files (#17161) @shrshi
Process parquet bools with microkernels (#17157) @pmattione-nvidia
Upgrade to polars 1.11 in cudf-polars (#17154) @wence-
Deprecate current libcudf nvtext minhash functions (#17152) @davidwendt
Remove unused variable in internal merge_tdigests utility (#17151) @davidwendt
Use the full ref name of rmm.DeviceBuffer in the sphinx config file (#17150) @Matt711
Move segmented_gather function from the copying module to the lists module (#17148) @Matt711
Use async execution policy for true_if (#17146) @PointKernel
Add conversion from cudf-polars expressions to libcudf ast for parquet filters (#17141) @wence-
devcontainer: replace VAULT_HOST with AWS_ROLE_ARN (#17134) @jjacobelli
Replace direct cudaMemcpyAsync calls with utility functions (limited to cudf::io) (#17132) @vuule
use rapids-generate-pip-constraints to pin to oldest dependencies in CI (#17131) @jameslamb
Set the default number of threads in KvikIO thread pool to 8 (#17126) @kingcrimsontianyu
Fix clang-tidy violations for span.hpp and hostdevice_vector.hpp (#17124) @davidwendt
Disable the Parquet reader's wide lists tables GTest by default (#17120) @mhaseeb123
Add compile time check to ensure the counting_iterator type in counting_transform_iterator fits in size_type (#17118) @mhaseeb123
Minor I/O code quality improvements (#17105) @kingcrimsontianyu
Remove the additional host register calls initially intended for performance improvement on Grace Hopper (#17092) @kingcrimsontianyu
Split hash-based groupby into multiple smaller files to reduce build time (#17089) @PointKernel
build wheels without build isolation (#17088) @jameslamb
Polars: DataFrame Serialization (#17062) @madsbk
Remove unused hash helper functions (#17056) @PointKernel
Add to_dlpack/from_dlpack APIs to pylibcudf (#17055) @mroeschke
Move flatten_single_pass_aggs to its own TU (#17053) @PointKernel
Replace deprecated cuco APIs with updated versions (#17052) @PointKernel
Refactor ORC dictionary encoding to migrate to the new cuco::static_map (#17049) @mhaseeb123
Move pylibcudf/libcudf/wrappers/decimals to pylibcudf/libcudf/fixed_point (#17048) @mroeschke
make conda installs in CI stricter (part 2) (#17042) @jameslamb
Use managed memory for NDSH benchmarks (#17039) @karthikeyann
Clean up hash-groupby var_hash_functor (#17034) @PointKernel
Add json APIs to pylibcudf (#17025) @mroeschke
Add string.replace_re APIs to pylibcudf (#17023) @mroeschke
Replace old host tree algorithm with new algorithm in JSON reader (#17019) @karthikeyann
Unify treatment of Expr and IR nodes in cudf-polars DSL (#17016) @wence-
make conda installs in CI stricter (#17013) @jameslamb
Pylibcudf: pack and unpack (#17012) @madsbk
Remove unneeded pylibcudf.libcudf.wrappers.duration usage in cudf (#17010) @mroeschke
Add custom "fused" groupby aggregation to Dask cuDF (#17009) @rjzamora
Make tests more deterministic (#17008) @galipremsagar
Remove unused import (#17005) @Matt711
Add string.convert.convert_urls APIs to pylibcudf (#17003) @mroeschke
Add release tracking to project automation scripts (#17001) @jarmak-nv
Implement inequality joins by translation to conditional joins (#17000) @wence-
Add string.convert.convert_lists APIs to pylibcudf (#16997) @mroeschke
Performance optimization of JSON validation (#16996) @karthikeyann
Add string.convert.convert_ipv4 APIs to pylibcudf (#16994) @mroeschke
Add string.convert.convert_integers APIs to pylibcudf (#16991) @mroeschke
Add string.convert_floats APIs to pylibcudf (#16990) @mroeschke
Add string.convert.convert_fixed_type APIs to pylibcudf (#16984) @mroeschke
Remove unnecessary std::move's in pylibcudf (#16983) @Matt711
Add docstrings and test for strings.convert_durations APIs for pylibcudf (#16982) @mroeschke
JSON tokenizer memory optimizations (#16978) @shrshi
Turn on xfail_strict = true for all python packages (#16977) @wence-
Add string.convert.convert_datetime/convert_booleans APIs to pylibcudf (#16971) @mroeschke
Auto assign PR to author (#16969) @Matt711
Deprecate support for directly accessing logger (#16964) @vyasr
Expunge NamedColumn (#16962) @wence-
Add clang-tidy to CI (#16958) @vyasr
Address all remaining clang-tidy errors (#16956) @vyasr
Apply clang-tidy autofixes (#16949) @vyasr
Use nvcomp wheel instead of bundling nvcomp (#16946) @KyleFromNVIDIA
Refactor the cuda_memcpy functions to make them more usable (#16945) @vuule
Add string.split APIs to pylibcudf (#16940) @mroeschke
clang-tidy fixes part 3 (#16939) @vyasr
clang-tidy fixes part 2 (#16938) @vyasr
clang-tidy fixes part 1 (#16937) @vyasr
Add string.wrap APIs to pylibcudf (#16935) @mroeschke
Add string.translate APIs to pylibcudf (#16934) @mroeschke
Add string.find_multiple APIs to pylibcudf (#16920) @mroeschke
Batch memcpy the last offsets for output buffers of str and list cols in PQ reader (#16905) @mhaseeb123
reduce wheel build verbosity, narrow deprecation warning filter (#16896) @jameslamb
Improve aggregation device functors (#16884) @PointKernel
Upgrade pandas pinnings to support 2.2.3 (#16882) @galipremsagar
Fix 24.10 to 24.12 forward merge (#16876) @bdice
Manually resolve conflicts in between branch-24.12 and branch-24.10 (#16871) @galipremsagar
Add in support for setting delim when parsing JSON through java (#16867) @revans2
Reapply mixed_semi_join refactoring and bug fixes (#16859) @mhaseeb123
Add string padding and side_type APIs to pylibcudf (#16833) @mroeschke
Organize parquet reader mukernel non-nullable code, introduce manual block scans (#16830) @pmattione-nvidia
Remove superfluous use of std::vector for std::future (#16829) @kingcrimsontianyu
Rework read_csv IO to avoid reading whole input with a single host_read (#16826) @vuule
Add strings.combine APIs to pylibcudf (#16790) @mroeschke
Add remaining string.char_types APIs to pylibcudf (#16788) @mroeschke
Add new nvtext minhash_permuted API (#16756) @davidwendt
Avoid public constructors when called with columns to avoid unnecessary validation (#16747) @mroeschke
Use changed-files shared workflow (#16713) @KyleFromNVIDIA
lint: replace isort with Ruff's rule I (#16685) @Borda
Improve the performance of low cardinality groupby (#16619) @PointKernel
Parquet reader list microkernel (#16538) @pmattione-nvidia
AWS S3 IO through KvikIO (#16499) @madsbk
Refactor histogram reduction using cuco::static_set::insert_and_find (#16485) @srinivasyadav18
Use numba-cuda>=0.0.13 (#16474) @gmarkall

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NIGHTLY] v24.12.00

🔗 Links

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors