Releases: rapidsai/cudf
Releases · rapidsai/cudf
v22.10.00
🚨 Breaking Changes
- Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Disable nvCOMP DEFLATE integration (#11811) @vuule
- Fix return type of
Index.isna
&Index.notna
(#11769) @galipremsagar - Remove
kwargs
inread_csv
&to_csv
(#11762) @galipremsagar - Fix
cudf::partition*
APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Upgrade
pandas
to1.5
(#11617) @galipremsagar - Change default value of
ordered
toFalse
inCategoricalDtype
(#11604) @galipremsagar - Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Deprecate
skiprows
andnum_rows
inread_orc
(#11522) @galipremsagar - Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
- Drop support for
skiprows
andnum_rows
incudf.read_parquet
(#11480) @galipremsagar - Disable Arrow S3 support by default. (#11470) @bdice
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Buffer
class (#11447) @madsbk - Return empty dataframe when reading an ORC file using empty
columns
option (#11446) @vuule - Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
- Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice
🐛 Bug Fixes
- Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
- Handle
ptx
file paths duringstrings_udf
import (#11862) @galipremsagar - Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Reset
strings_udf
CEC and solve several related issues (#11846) @brandon-b-miller - Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
- Fix
is_valid
checks inScalar._binaryop
(#11818) @wence- - Fix operator
NotImplemented
issue withnumpy
(#11816) @galipremsagar - Disable nvCOMP DEFLATE integration (#11811) @vuule
- Build
strings_udf
package with other python packages in nightlies (#11808) @brandon-b-miller - Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
- Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
- Build
cudf
locally before buildingstrings_udf
conda packages in CI (#11785) @brandon-b-miller - Fix an issue in cudf::row_bit_count involving structs and lists at multiple levels. (#11779) @nvdbaranec
- Fix return type of
Index.isna
&Index.notna
(#11769) @galipremsagar - Fix issue with set-item incase of
list
andstruct
types (#11760) @galipremsagar - Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
- Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
- Fix ORC string sum statistics (#11740) @vuule
- Add
strings_udf
package for python 3.9 (#11730) @brandon-b-miller - Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
- Don't assume stream is a compile-time constant expression (#11725) @vyasr
- Fix get_thrust.cmake format at patch command (#11715) @davidwendt
- Fix
cudf::partition*
APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
- Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
- Fix
DataFrame.from_arrow
to preserve type metadata (#11698) @galipremsagar - Fix compile error due to missing header (#11697) @ttnghia
- Default to Snappy compression in
to_orc
when using cuDF or Dask (#11690) @vuule - Fix an issue related to
Multindex
whengroup_keys=True
(#11689) @galipremsagar - Transfer correct dtype to exploded column (#11687) @wence-
- Ignore protobuf generated files in
mypy
checks (#11685) @galipremsagar - Maintain the index name after
.loc
(#11677) @shwina - Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
- Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
- Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
- Fix multi-file remote datasource bug (#11655) @rjzamora
- Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
- Fix bug in
device_write()
: it uses an incorrect size (#11651) @madsbk - fixes overflows in benchmarks (#11649) @elstehle
- Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
- Fix host scalars construction of nested types (#11612) @galipremsagar
- Fix compile warning in nested_json_gpu.cu (#11607) @davidwendt
- Change default value of
ordered
toFalse
inCategoricalDtype
(#11604) @galipremsagar - Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
- Add is_timestamp test for leap second (60) (#11594) @davidwendt
- Fix an issue with
to_arrow
when column name type is not a string (#11590) @galipremsagar - Fix exception in segmented-reduce benchmark (#11588) @davidwendt
- Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
- Correct distribution data type in
quantiles
benchmark (#11584) @vuule - Fix multibyte_split benchmark for host buffers (#11583) @upsj
- xfail custreamz display test for now (#11567) @shwina
- Fix JNI for TableWithMeta to use schema_info instead of column_names (#11566) @jlowe
- Reduce code duplication for
dask
&distributed
nightly/stable installs (#11565) @galipremsagar - Fix groupby failures in dask_cudf CI (#11561) @rjzamora
- Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
- find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
- Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
- Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
- Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
- Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
- Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
- Update parquet fuzz tests to drop support for
skiprows
&num_rows
(#11505) @galipremsagar - Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
- Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
- Return empty dataframe when reading an ORC file using empty
columns
option (#11446) @vuule - libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
- Fix regex quantifier check to include capture groups (#11373) @davidwendt
- Fix read_text when byte_range is aligned with field (#11371) @upsj
- Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
- column: calculate null_count before release()ing the cudf::column (#11365) @wence-
📖 Documentation
- Update
guide-to-udfs
notebook (#11861) @brandon-b-miller - Update docstring for cudf.read_text (#11799) @GregoryKimball
- Add doc section for
list
&struct
handling (#11770) @galipremsagar - Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
- Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
- Add docs for use of string data to
DataFrame.apply
andSeries.apply
and update guide to UDFs notebook (#11733) @brandon-b-miller - Enable more Pydocstyle rules (#11582) @bdice
- Remove unused cpp/img folder (#11554) @davidwendt
- Publish C++ developer docs (#11475) @vyasr
- Fix a misalignment in
cudf.get_dummies
docstring (#11443) @galipremsagar - Update contributing doc to include links to the developer guides (#11390) @davidwendt
- Fix table_view_base doxygen format (#11340) @davidwendt
- Create main developer guide for Python (#11235) @vyasr
- Add developer documentation for benchmarking (#11122) @vyasr
- cuDF error handling document (#7917) @isVoid
🚀 New Features
- Add hasNull statistic reading ability to ORC (#11747) @devavret
- Add
istitle
to string UDFs (#11738) @brandon-b-miller - JSON Column creation in GPU (#11714) @karthikeyann
- Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
- Add BGZIP
data_chunk_reader
(#11652) @upsj - Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
- changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
- Generate unique keys table in java JNI
contiguousSplitGroups
(#11614) @res-life - Generic type casting to support the new nested JSON reader (#11613) @elstehle
- JSON tree traversal (#11610) @karthikeyann
- Add casting operators to masked UDFs (#11578) @brandon-b-miller
- Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
- Add strings 'like' function (#11558) @davidwendt
- Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
- Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
- Adds support for json lines format to the nested JSON reader (#11534) @elstehle
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
- Add
gdb
pretty-printers for simple types (#11499) @upsj - Add
create_random_column
function to the d...
v22.08.01
🚨 Breaking Changes
- Pin
numpy
to<1.23
(#11824) @galipremsagar - Remove legacy join APIs (#11274) @vyasr
- Remove
lists::drop_list_duplicates
(#11236) @ttnghia - Remove Index.replace API (#11131) @vyasr
- Remove deprecated Index methods from Frame (#11073) @vyasr
- Remove public API of cudf.merge_sorted. (#11032) @bdice
- Drop python
3.7
in code-base (#11029) @galipremsagar - Return empty dataframe when reading a Parquet file using empty
columns
option (#11018) @vuule - Remove Arrow CUDA IPC code (#10995) @shwina
- Buffer: make
.ptr
read-only (#10872) @madsbk
🐛 Bug Fixes
- Fix out-of-bound access in
cudf::detail::label_segments
(#11497) @ttnghia - Fix
distributed
error related toloop_in_thread
(#11428) @galipremsagar - Fix atomic operations on NaN values (#11420) @ttnghia
- Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
- Revert "Allow CuPy 11" (#11409) @jakirkham
- Fix
moto
timeouts (#11369) @galipremsagar - Set
+/-infinity
as theidentity
values for floating-point numbers in device operatorsmin
andmax
(#11357) @ttnghia - Fix memory_usage() for
ListSeries
(#11355) @thomcom - Fix constructing Column from column_view with expired mask (#11354) @shwina
- Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
- Fix
DatetimeIndex
&TimedeltaIndex
constructors (#11342) @galipremsagar - Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
- Fix performance issue and add a new code path to
cudf::detail::contains
(#11330) @ttnghia - Pin
pytorch
to temporarily unblock fromlibcupti
errors (#11289) @galipremsagar - Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
- Fix inconsistency when hashing two tables in
cudf::detail::contains
(#11284) @ttnghia - Fix issue related to numpy array and
category
dtype (#11282) @galipremsagar - Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
- Fix invalid allocate_like() and empty_like() tests. (#11268) @nvdbaranec
- Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
- Fix compile error due to missing header (#11257) @ttnghia
- Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
- Fix
tests/rolling/empty_input_test
(#11238) @ttnghia - Fix const qualifier when using
host_span<bitmask_type const*>
(#11220) @ttnghia - Avoid using
nvcompBatchedDeflateDecompressGetTempSizeEx
in cuIO (#11213) @vuule - Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
- Fix cumulative count index behavior (#11188) @brandon-b-miller
- Fix assertion in dask_cudf test_struct_explode (#11170) @rjzamora
- Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
- Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
- Ensure cuco export set is installed in cmake build (#11147) @jlowe
- Avoid redundant deepcopy in
cudf.from_pandas
(#11142) @galipremsagar - Fix compile error due to missing header (#11126) @ttnghia
- Fix
__cuda_array_interface__
failures (#11113) @galipremsagar - Support octal and hex within regex character class pattern (#11112) @davidwendt
- Fix split_re matching logic for word boundaries (#11106) @davidwendt
- Handle multiple files metadata in
read_parquet
(#11105) @galipremsagar - Fix index alignment for Series objects with repeated index (#11103) @shwina
- FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
- Fix regex word boundary logic to include underline (#11099) @davidwendt
- Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
- Fix duplicate
cudatoolkit
pinning issue (#11070) @galipremsagar - Maintain the input index in the result of a groupby-transform (#11068) @shwina
- Fix bug with row count comparison for expect_columns_equivalent(). (#11059) @nvdbaranec
- Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
- Include missing header for usage of
get_current_device_resource()
(#11047) @AtlantaPepsi - Fix warn_unused_result error in parquet test (#11026) @karthikeyann
- Return empty dataframe when reading a Parquet file using empty
columns
option (#11018) @vuule - Fix small error in page row count limiting (#10991) @etseidl
- Fix a row index entry error in ORC writer issue (#10989) @vuule
- Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice
📖 Documentation
- Defer loading of
custom.js
(#11465) @galipremsagar - Fix issues with day & night modes in python docs (#11400) @galipremsagar
- Update missing data handling APIs in docs (#11345) @galipremsagar
- Add lists filtering APIs to doxygen group. (#11336) @bdice
- Remove unused import in README sample (#11318) @vyasr
- Note null behavior in
where
docs (#11276) @brandon-b-miller - Update docstring for spans in
get_row_data_range
(#11271) @vyasr - Update nvCOMP integration table (#11231) @vuule
- Add dev docs for documentation writing (#11217) @vyasr
- Documentation fix for concatenate (#11187) @dagardner-nv
- Fix unresolved links in markdown (#11173) @karthikeyann
- Fix cudf version in README.md install commands (#11164) @jvanstraten
- Switch
language
fromNone
to"en"
in docs build (#11133) @galipremsagar - Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
- Add docstring entry for
DataFrame.value_counts
(#11039) @galipremsagar - Add docs to rolling var, std, count. (#11035) @bdice
- Fix docs for Numba UDFs. (#11020) @bdice
- Replace column comparison utilities functions with macros (#11007) @karthikeyann
- Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
- Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
- Fix Doxygen warnings in table header files (#10964) @karthikeyann
- Fix Doxygen warnings in column header files (#10963) @karthikeyann
- Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
- Generate Doxygen Tag File for Libcudf (#10932) @isVoid
- Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
- Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
- Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
- fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
- fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
- Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
- Add missing documentation in aggregation.hpp (#10887) @karthikeyann
- Revise PR template. (#10774) @bdice
🚀 New Features
- Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
- Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
- Adding byte array view structure (#11322) @hyperbolic2346
- Adding byte_array statistics (#11303) @hyperbolic2346
- Add column indexes to Parquet writer (#11302) @etseidl
- Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
- FST benchmark (#11243) @karthikeyann
- Adds the Finite-State Transducer algorithm (#11242) @elstehle
- Refactor
collect_set
to usecudf::distinct
andcudf::lists::distinct
(#11228) @ttnghia - Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
- Add 24 bit dictionary support to Parquet writer (#11216) @devavret
- Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
- JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
- Add JNI bindings for extractAllRecord (#11196) @anthony-chang
- Add
cudf.options
(#11193) @isVoid - Add thrift support for parquet column and offset indexes (#11178) @etseidl
- Adding binary read/write as options for parquet (#11160) @hyperbolic2346
- Support
nth_element
for window functions (#11158) @mythrocks - Implement
lists::distinct
andcudf::detail::stable_distinct
(#11149) @ttnghia - Implement Groupby pct_change (#11144) @skirui-source
- Add JNI for set operations (#11143) @ttnghia
- Remove deprecated PER_THREAD_DEFAULT_STREAM (#11134) @jbrennan333
- Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
- Feature/python benchmarking (#11125) @vyasr
- Support
nan_equality
incudf::distinct
(#11118) @ttnghia - Added JNI for getMapValueForKeys (#11104) @razajafri
- Refactor
semi_anti_join
(#11100) @ttnghia - Replace remaining instances of rmm::cuda_stream_default with cudf::default_stream_value (#11082) @jbrennan333
- Adds the Logical Stack algorithm (#11078) @elstehle
- Add doxygen-check pre-commit hook (#11076) @karthikeyann
- Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
- Add Doxygen CI check (#11057) @karthikeyann
- Support
duplicate_keep_option
incudf::distinct
(#11052) @ttnghia - Support set operations (#11043) @ttnghia
- Support for ZLIB compression in ORC writer (#11036) @vuule
- Adding feature swaplevels (#11027) @VamsiTallam95
- Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
- Function for bfill, ffill #9591 (#11022) @Sreekiran096
- Generate group offsets from element labels (#11017) @ttnghia
- Feature axes (#10979) @VamsiTallam95
- Generate group labels from offsets (#10945) @ttnghia
- Add missing cuIO benchmark coverage for duration types (#10933) @vuule
- Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
- Reindex Improvements (#10815) @brandon-b-miller
- Implement value_counts for DataFrame (#10813) @martinfalisse
🛠️ Improvements
- Pin
numpy
to<1.23
(#11824) @galipremsagar - Make Index Join Tests on Default Precisions Deterministic (#11451) @isVoid
- Pin
dask
&distributed
for release (#11433) @galipremsagar...
v22.08.00
🚨 Breaking Changes
- Remove legacy join APIs (#11274) @vyasr
- Remove
lists::drop_list_duplicates
(#11236) @ttnghia - Remove Index.replace API (#11131) @vyasr
- Remove deprecated Index methods from Frame (#11073) @vyasr
- Remove public API of cudf.merge_sorted. (#11032) @bdice
- Drop python
3.7
in code-base (#11029) @galipremsagar - Return empty dataframe when reading a Parquet file using empty
columns
option (#11018) @vuule - Remove Arrow CUDA IPC code (#10995) @shwina
- Buffer: make
.ptr
read-only (#10872) @madsbk
🐛 Bug Fixes
- Fix
distributed
error related toloop_in_thread
(#11428) @galipremsagar - Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
- Revert "Allow CuPy 11" (#11409) @jakirkham
- Fix
moto
timeouts (#11369) @galipremsagar - Set
+/-infinity
as theidentity
values for floating-point numbers in device operatorsmin
andmax
(#11357) @ttnghia - Fix memory_usage() for
ListSeries
(#11355) @thomcom - Fix constructing Column from column_view with expired mask (#11354) @shwina
- Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
- Fix
DatetimeIndex
&TimedeltaIndex
constructors (#11342) @galipremsagar - Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
- Fix performance issue and add a new code path to
cudf::detail::contains
(#11330) @ttnghia - Pin
pytorch
to temporarily unblock fromlibcupti
errors (#11289) @galipremsagar - Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
- Fix inconsistency when hashing two tables in
cudf::detail::contains
(#11284) @ttnghia - Fix issue related to numpy array and
category
dtype (#11282) @galipremsagar - Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
- Fix invalid allocate_like() and empty_like() tests. (#11268) @nvdbaranec
- Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
- Fix compile error due to missing header (#11257) @ttnghia
- Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
- Fix
tests/rolling/empty_input_test
(#11238) @ttnghia - Fix const qualifier when using
host_span<bitmask_type const*>
(#11220) @ttnghia - Avoid using
nvcompBatchedDeflateDecompressGetTempSizeEx
in cuIO (#11213) @vuule - Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
- Fix cumulative count index behavior (#11188) @brandon-b-miller
- Fix assertion in dask_cudf test_struct_explode (#11170) @rjzamora
- Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
- Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
- Ensure cuco export set is installed in cmake build (#11147) @jlowe
- Avoid redundant deepcopy in
cudf.from_pandas
(#11142) @galipremsagar - Fix compile error due to missing header (#11126) @ttnghia
- Fix
__cuda_array_interface__
failures (#11113) @galipremsagar - Support octal and hex within regex character class pattern (#11112) @davidwendt
- Fix split_re matching logic for word boundaries (#11106) @davidwendt
- Handle multiple files metadata in
read_parquet
(#11105) @galipremsagar - Fix index alignment for Series objects with repeated index (#11103) @shwina
- FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
- Fix regex word boundary logic to include underline (#11099) @davidwendt
- Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
- Fix duplicate
cudatoolkit
pinning issue (#11070) @galipremsagar - Maintain the input index in the result of a groupby-transform (#11068) @shwina
- Fix bug with row count comparison for expect_columns_equivalent(). (#11059) @nvdbaranec
- Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
- Include missing header for usage of
get_current_device_resource()
(#11047) @AtlantaPepsi - Fix warn_unused_result error in parquet test (#11026) @karthikeyann
- Return empty dataframe when reading a Parquet file using empty
columns
option (#11018) @vuule - Fix small error in page row count limiting (#10991) @etseidl
- Fix a row index entry error in ORC writer issue (#10989) @vuule
- Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice
📖 Documentation
- Fix issues with day & night modes in python docs (#11400) @galipremsagar
- Update missing data handling APIs in docs (#11345) @galipremsagar
- Add lists filtering APIs to doxygen group. (#11336) @bdice
- Remove unused import in README sample (#11318) @vyasr
- Note null behavior in
where
docs (#11276) @brandon-b-miller - Update docstring for spans in
get_row_data_range
(#11271) @vyasr - Update nvCOMP integration table (#11231) @vuule
- Add dev docs for documentation writing (#11217) @vyasr
- Documentation fix for concatenate (#11187) @dagardner-nv
- Fix unresolved links in markdown (#11173) @karthikeyann
- Fix cudf version in README.md install commands (#11164) @jvanstraten
- Switch
language
fromNone
to"en"
in docs build (#11133) @galipremsagar - Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
- Add docstring entry for
DataFrame.value_counts
(#11039) @galipremsagar - Add docs to rolling var, std, count. (#11035) @bdice
- Fix docs for Numba UDFs. (#11020) @bdice
- Replace column comparison utilities functions with macros (#11007) @karthikeyann
- Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
- Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
- Fix Doxygen warnings in table header files (#10964) @karthikeyann
- Fix Doxygen warnings in column header files (#10963) @karthikeyann
- Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
- Generate Doxygen Tag File for Libcudf (#10932) @isVoid
- Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
- Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
- Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
- fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
- fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
- Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
- Add missing documentation in aggregation.hpp (#10887) @karthikeyann
- Revise PR template. (#10774) @bdice
🚀 New Features
- Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
- Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
- Adding byte array view structure (#11322) @hyperbolic2346
- Adding byte_array statistics (#11303) @hyperbolic2346
- Add column indexes to Parquet writer (#11302) @etseidl
- Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
- FST benchmark (#11243) @karthikeyann
- Adds the Finite-State Transducer algorithm (#11242) @elstehle
- Refactor
collect_set
to usecudf::distinct
andcudf::lists::distinct
(#11228) @ttnghia - Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
- Add 24 bit dictionary support to Parquet writer (#11216) @devavret
- Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
- JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
- Add JNI bindings for extractAllRecord (#11196) @anthony-chang
- Add
cudf.options
(#11193) @isVoid - Add thrift support for parquet column and offset indexes (#11178) @etseidl
- Adding binary read/write as options for parquet (#11160) @hyperbolic2346
- Support
nth_element
for window functions (#11158) @mythrocks - Implement
lists::distinct
andcudf::detail::stable_distinct
(#11149) @ttnghia - Implement Groupby pct_change (#11144) @skirui-source
- Add JNI for set operations (#11143) @ttnghia
- Remove deprecated PER_THREAD_DEFAULT_STREAM (#11134) @jbrennan333
- Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
- Feature/python benchmarking (#11125) @vyasr
- Support
nan_equality
incudf::distinct
(#11118) @ttnghia - Added JNI for getMapValueForKeys (#11104) @razajafri
- Refactor
semi_anti_join
(#11100) @ttnghia - Replace remaining instances of rmm::cuda_stream_default with cudf::default_stream_value (#11082) @jbrennan333
- Adds the Logical Stack algorithm (#11078) @elstehle
- Add doxygen-check pre-commit hook (#11076) @karthikeyann
- Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
- Add Doxygen CI check (#11057) @karthikeyann
- Support
duplicate_keep_option
incudf::distinct
(#11052) @ttnghia - Support set operations (#11043) @ttnghia
- Support for ZLIB compression in ORC writer (#11036) @vuule
- Adding feature swaplevels (#11027) @VamsiTallam95
- Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
- Function for bfill, ffill #9591 (#11022) @Sreekiran096
- Generate group offsets from element labels (#11017) @ttnghia
- Feature axes (#10979) @VamsiTallam95
- Generate group labels from offsets (#10945) @ttnghia
- Add missing cuIO benchmark coverage for duration types (#10933) @vuule
- Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
- Reindex Improvements (#10815) @brandon-b-miller
- Implement value_counts for DataFrame (#10813) @martinfalisse
🛠️ Improvements
- Pin
dask
&distributed
for release (#11433) @galipremsagar - Use documented header template for
doxygen
(#11430) @galipremsagar - Relax arrow version in dev env (#11418) @galipremsagar
- Allow CuPy 11 (#11393) @jakirkham
- Improve multibyte_split performance (#11347) @cwharris
- Switch death test to use explicit trap. (#11326) @vyasr
- Add --output-on-failure to ctest args. (#11321) @vyasr
- Consolidate remaining DataFrame...
[NIGHTLY] v22.10.00
🔗 Links
🚨 Breaking Changes
- Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Disable nvCOMP DEFLATE integration (#11811) @vuule
- Fix return type of
Index.isna
&Index.notna
(#11769) @galipremsagar - Remove
kwargs
inread_csv
&to_csv
(#11762) @galipremsagar - Fix
cudf::partition*
APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Upgrade
pandas
to1.5
(#11617) @galipremsagar - Change default value of
ordered
toFalse
inCategoricalDtype
(#11604) @galipremsagar - Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Deprecate
skiprows
andnum_rows
inread_orc
(#11522) @galipremsagar - Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
- Drop support for
skiprows
andnum_rows
incudf.read_parquet
(#11480) @galipremsagar - Disable Arrow S3 support by default. (#11470) @bdice
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Buffer
class (#11447) @madsbk - Return empty dataframe when reading an ORC file using empty
columns
option (#11446) @vuule - Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
- Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice
🐛 Bug Fixes
- Force using old fmt in nvbench. (#12064) @vyasr
- Update cuda-python dependency to 11.7.1 (#11994) @shwina
- Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
- Handle
ptx
file paths duringstrings_udf
import (#11862) @galipremsagar - Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Reset
strings_udf
CEC and solve several related issues (#11846) @brandon-b-miller - Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
- Fix
is_valid
checks inScalar._binaryop
(#11818) @wence- - Fix operator
NotImplemented
issue withnumpy
(#11816) @galipremsagar - Disable nvCOMP DEFLATE integration (#11811) @vuule
- Build
strings_udf
package with other python packages in nightlies (#11808) @brandon-b-miller - Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
- Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
- Build
cudf
locally before buildingstrings_udf
conda packages in CI (#11785) @brandon-b-miller - Fix an issue in cudf::row_bit_count involving structs and lists at multiple levels. (#11779) @nvdbaranec
- Fix return type of
Index.isna
&Index.notna
(#11769) @galipremsagar - Fix issue with set-item incase of
list
andstruct
types (#11760) @galipremsagar - Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
- Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
- Fix ORC string sum statistics (#11740) @vuule
- Add
strings_udf
package for python 3.9 (#11730) @brandon-b-miller - Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
- Don't assume stream is a compile-time constant expression (#11725) @vyasr
- Fix get_thrust.cmake format at patch command (#11715) @davidwendt
- Fix
cudf::partition*
APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
- Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
- Fix
DataFrame.from_arrow
to preserve type metadata (#11698) @galipremsagar - Fix compile error due to missing header (#11697) @ttnghia
- Default to Snappy compression in
to_orc
when using cuDF or Dask (#11690) @vuule - Fix an issue related to
Multindex
whengroup_keys=True
(#11689) @galipremsagar - Transfer correct dtype to exploded column (#11687) @wence-
- Ignore protobuf generated files in
mypy
checks (#11685) @galipremsagar - Maintain the index name after
.loc
(#11677) @shwina - Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
- Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
- Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
- Fix multi-file remote datasource bug (#11655) @rjzamora
- Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
- Fix bug in
device_write()
: it uses an incorrect size (#11651) @madsbk - fixes overflows in benchmarks (#11649) @elstehle
- Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
- Fix host scalars construction of nested types (#11612) @galipremsagar
- Fix compile warning in nested_json_gpu.cu (#11607) @davidwendt
- Change default value of
ordered
toFalse
inCategoricalDtype
(#11604) @galipremsagar - Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
- Add is_timestamp test for leap second (60) (#11594) @davidwendt
- Fix an issue with
to_arrow
when column name type is not a string (#11590) @galipremsagar - Fix exception in segmented-reduce benchmark (#11588) @davidwendt
- Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
- Correct distribution data type in
quantiles
benchmark (#11584) @vuule - Fix multibyte_split benchmark for host buffers (#11583) @upsj
- xfail custreamz display test for now (#11567) @shwina
- Fix JNI for TableWithMeta to use schema_info instead of column_names (#11566) @jlowe
- Reduce code duplication for
dask
&distributed
nightly/stable installs (#11565) @galipremsagar - Fix groupby failures in dask_cudf CI (#11561) @rjzamora
- Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
- find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
- Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
- Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
- Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
- Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
- Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
- Update parquet fuzz tests to drop support for
skiprows
&num_rows
(#11505) @galipremsagar - Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
- Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
- Return empty dataframe when reading an ORC file using empty
columns
option (#11446) @vuule - libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
- Fix regex quantifier check to include capture groups (#11373) @davidwendt
- Fix read_text when byte_range is aligned with field (#11371) @upsj
- Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
- column: calculate null_count before release()ing the cudf::column (#11365) @wence-
📖 Documentation
- Update
guide-to-udfs
notebook (#11861) @brandon-b-miller - Update docstring for cudf.read_text (#11799) @GregoryKimball
- Add doc section for
list
&struct
handling (#11770) @galipremsagar - Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
- Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
- Add docs for use of string data to
DataFrame.apply
andSeries.apply
and update guide to UDFs notebook (#11733) @brandon-b-miller - Enable more Pydocstyle rules (#11582) @bdice
- Remove unused cpp/img folder (#11554) @davidwendt
- Publish C++ developer docs (#11475) @vyasr
- Fix a misalignment in
cudf.get_dummies
docstring (#11443) @galipremsagar - Update contributing doc to include links to the developer guides (#11390) @davidwendt
- Fix table_view_base doxygen format (#11340) @davidwendt
- Create main developer guide for Python (#11235) @vyasr
- Add developer documentation for benchmarking (#11122) @vyasr
- cuDF error handling document (#7917) @isVoid
🚀 New Features
- Add hasNull statistic reading ability to ORC (#11747) @devavret
- Add
istitle
to string UDFs (#11738) @brandon-b-miller - JSON Column creation in GPU (#11714) @karthikeyann
- Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
- Add BGZIP
data_chunk_reader
(#11652) @upsj - Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
- changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
- Generate unique keys table in java JNI
contiguousSplitGroups
(#11614) @res-life - Generic type casting to support the new nested JSON reader (#11613) @elstehle
- JSON tree traversal (#11610) @karthikeyann
- Add casting operators to masked UDFs (#11578) @brandon-b-miller
- Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
- Add strings 'like' function (#11558) @davidwendt
- Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
- Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
- Adds support for json lines format to the ne...
v22.06.01
v22.06.00
🚨 Breaking Changes
- Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
- Rename
sliced_child
toget_sliced_child
. (#10885) @bdice - Add parameters to control page size in Parquet writer (#10882) @etseidl
- Make cudf::test::expect_columns_equal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
- Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
- Refactor
cudf::contains
, renaming and switching parameters role (#10802) @ttnghia - Generic serialization of all column types (#10784) @wence-
- Return per-file metadata from readers (#10782) @vuule
- HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
- Update
groupby::hash
to use new row operators for keys (#10770) @PointKernel - update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
- Rename CUDA_TRY macro to CUDF_CUDA_TRY, rename CHECK_CUDA macro to CUDF_CHECK_CUDA. (#10589) @bdice
- Upgrade
cudf
to supportpandas
1.4.x versions (#10584) @galipremsagar - Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
- Add default= kwarg to .list.get() accessor method (#10547) @shwina
- Remove deprecated
decimal_cols_as_float
in the ORC reader (#10515) @vuule - Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
- Fix findall_record to return empty list for no matches (#10491) @davidwendt
- Namespace/Docstring Fixes for Reduction (#10471) @isVoid
- Additional refactoring of hash functions (#10462) @bdice
- Fix default value of str.split expand parameter. (#10457) @bdice
- Remove deprecated code. (#10450) @vyasr
🐛 Bug Fixes
- Fix single column
MultiIndex
issue insort_index
(#10957) @galipremsagar - Make SerializedTableHeader(numRows) public (#10949) @gerashegalov
- Fix
gcc_linux
version pinning in dev environment (#10943) @galipremsagar - Fix an issue with reading raw string in
cudf.read_json
(#10924) @galipremsagar - Make cudf::test::expect_columns_equal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
- Fix segmented_reduce on empty column with non-empty offsets (#10876) @davidwendt
- Fix dask-cudf groupby handling when grouping by all columns (#10866) @charlesbluca
- Fix a bug in
distinct
: using nested nulls logic (#10848) @PointKernel - Fix constness / references in weak ordering operator() signatures. (#10846) @bdice
- Suppress sizeof-array-div warnings in thrust found by gcc-11 (#10840) @robertmaynard
- Add handling for string by-columns in dask-cudf groupby (#10830) @charlesbluca
- Fix compile warning in search.cu (#10827) @davidwendt
- Fix element access const correctness in
hostdevice_vector
(#10804) @vuule - Update
cuco
git tag (#10788) @PointKernel - HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
- Fixing deprecation warnings in test_orc.py (#10772) @hyperbolic2346
- Enable writing to
s3
storage in chunked parquet writer (#10769) @galipremsagar - Fix construction of nested structs with EMPTY child (#10761) @shwina
- Fix replace error when regex has only zero match quantifiers (#10760) @davidwendt
- Fix an issue with one_level_list schemas in parquet reader. (#10750) @nvdbaranec
- update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
- Fix
cupy
function in notebook (#10737) @ajschmidt8 - Fix
fillna
to retaincolumns
when it isMultiIndex
(#10729) @galipremsagar - Fix scatter for all-empty-string column case (#10724) @davidwendt
- Retain series name in
Series.apply
(#10716) @brandon-b-miller - Correct build dir
cudf-config
dependency issues for static builds (#10704) @robertmaynard - Fix list of testing requirements in setup.py. (#10678) @bdice
- Fix rounding to zero error in stod on very small float numbers (#10672) @davidwendt
- cuco isn't a cudf dependency when we are built shared (#10662) @robertmaynard
- Fix to_timestamps to support Z for %z format specifier (#10617) @davidwendt
- Verify compression type in Parquet reader (#10610) @vuule
- Fix struct row comparator's exception on empty structs (#10604) @sperlingxx
- Fix strings strip() to accept only str Scalar for to_strip parameter (#10597) @davidwendt
- Fix has_atomic_support check in can_use_hash_groupby() (#10588) @jbrennan333
- Revert Thrust 1.16 to Thrust 1.15 (#10586) @bdice
- Fix missing RMM_STATIC_CUDART define when compiling JNI with static CUDA runtime (#10585) @jlowe
- pin more cmake versions (#10570) @robertmaynard
- Re-enable Build Metrics Report (#10562) @davidwendt
- Remove statically linked CUDA runtime check in Java build (#10532) @jlowe
- Fix temp data cleanup in
test_text.py
(#10524) @brandon-b-miller - Update pre-commit to run black 22.3.0 (#10523) @vyasr
- Remove deprecated
decimal_cols_as_float
in the ORC reader (#10515) @vuule - Fix findall_record to return empty list for no matches (#10491) @davidwendt
- Allow users to specify data types for a subset of columns in
read_csv
(#10484) @vuule - Fix default value of str.split expand parameter. (#10457) @bdice
- Improve coverage of dask-cudf's groupby aggregation, add tests for
dropna
support (#10449) @charlesbluca - Allow string aggs for
dask_cudf.CudfDataFrameGroupBy.aggregate
(#10222) @charlesbluca - In-place updates with loc or iloc don't work correctly when the LHS has more than one column (#9918) @skirui-source
📖 Documentation
- Clarify append deprecation notice. (#10930) @bdice
- Use full name of GPUDirect Storage SDK in docs (#10904) @vuule
- Update Dask + Pandas to Dask + cuDF path (#10897) @miguelusque
- Add missing documentation in cudf/types.hpp (#10895) @karthikeyann
- Add strong index iterator docs. (#10888) @bdice
- spell check fixes (#10865) @karthikeyann
- Add missing documentation in scalar/ headers (#10861) @karthikeyann
- Remove typo in ngram documentation (#10859) @miguelusque
- fix doxygen warnings (#10842) @karthikeyann
- Add a library_design.md file documenting the core Python data structures and their relationship (#10817) @vyasr
- Add NumPy to intersphinx references. (#10809) @bdice
- Add a section to the docs that compares cuDF with Pandas (#10796) @shwina
- Mention 2 cpp-reviewer requirement in pull request template (#10768) @davidwendt
- Enable pydocstyle for all packages. (#10759) @bdice
- Enable pydocstyle rules involving quotes (#10748) @vyasr
- Revise 10 minutes notebook. (#10738) @bdice
- Reorganize cuDF Python docs (#10691) @shwina
- Fix sphinx/jupyter heading issue in UDF notebook (#10690) @brandon-b-miller
- Migrated user guide notebooks to MyST-NB and added sphinx extension (#10685) @mmccarty
- add data generation to benchmark documentation (#10677) @karthikeyann
- Fix some docs build warnings (#10674) @galipremsagar
- Update UDF notebook in User Guide. (#10668) @bdice
- Improve User Guide docs (#10663) @bdice
- Fix some docstrings formatting (#10660) @galipremsagar
- Remove implementation details from
apply
docstrings (#10651) @brandon-b-miller - Revise CONTRIBUTING.md (#10644) @bdice
- Add missing APIs to documentation. (#10643) @bdice
- Use cudf.read_json as documented API name. (#10640) @bdice
- Fix docstring section headings. (#10639) @bdice
- Document cudf.read_text and cudf.read_avro. (#10638) @bdice
- Fix type-o in docstring for json_reader_options (#10627) @dagardner-nv
- Update guide to UDFs with notes about
Series.applymap
deprecation and related changes (#10607) @brandon-b-miller - Fix doxygen Modules page for cudf::lists::sequences (#10561) @davidwendt
- Add Replace Backreferences section to Regex Features page (#10560) @davidwendt
- Introduce deprecation policy to developer guide. (#10252) @vyasr
🚀 New Features
- Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
- Handle nested types in cudf::concatenate_rows() (#10890) @nvdbaranec
- Strong index types for equality comparator (#10883) @ttnghia
- Add parameters to control page size in Parquet writer (#10882) @etseidl
- Support for Zstandard decompression in ORC reader (#10873) @vuule
- Use pre-built nvcomp 2.3 binaries by default (#10851) @robertmaynard
- Support for Zstandard decompression in Parquet reader (#10847) @vuule
- Add JNI support for apply_boolean_mask (#10812) @res-life
- Segmented Min/Max for Fixed Point Types (#10794) @isVoid
- Return per-file metadata from readers (#10782) @vuule
- Segmented
apply_boolean_mask
forLIST
columns (#10773) @mythrocks - Update
groupby::hash
to use new row operators for keys (#10770) @PointKernel - Support purging non-empty null elements from LIST/STRING columns (#10701) @mythrocks
- Add
detail::hash_join
(#10695) @PointKernel - Persist string statistics data across multiple calls to orc chunked write (#10694) @hyperbolic2346
- Add
.list.astype()
to cast list leaves to specified dtype (#10693) @shwina - JNI: Add generateListOffsets API (#10683) @sperlingxx
- Support
args
in groupby apply (#10682) @brandon-b-miller - Enable segmented_gather in Java package (#10669) @sperlingxx
- Add row hasher with nested column support (#10641) @devavret
- Add support for numeric_only in DataFrame._reduce (#10629) @martinfalisse
- First step toward statistics in ORC files with chunked writes (#10567) @hyperbolic2346
- Add support for struct columns to the random table generator (#10566) @vuule
- Enable passing a sequence for the
index
argument to.list.get()
(#10564) @shwina - Add python bindings for cudf::list::index_of (#10549) @ChrisJar
- Add default= kwarg to .list.get() accessor method (#10547) @shwina
- Add
cudf.DataFrame.applymap
(#10542) @brandon-b-miller - Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
- Add column field ID control in parquet writer (#10504) @PointKernel
- Deprecate
Series.applymap
(#10497) @brandon-b-miller - Add option to drop cache in cuIO benchmarks (#10488) @vuule
- move...
v22.04.00
🚨 Breaking Changes
- Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
- Refactor stream compaction APIs (#10370) @PointKernel
- Add scan_aggregation and reduce_aggregation derived types. (#10357) @nvdbaranec
- Avoid
decimal
type narrowing for decimal binops (#10299) @galipremsagar - Rewrites
sample
API (#10262) @isVoid - Remove probe-time null equality parameters in
cudf::hash_join
(#10260) @PointKernel - Enable proper
Index
round-tripping inorc
reader and writer (#10170) @galipremsagar - Add JNI for
strings::split_re
andstrings::split_record_re
(#10139) @ttnghia - Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
- Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
- Remove deprecated code (#10124) @vyasr
- Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
- Optimize compaction operations (#10030) @PointKernel
- Remove deprecated method Series.set_index. (#9945) @bdice
- Add cudf::strings::findall_record API (#9911) @davidwendt
- Upgrade
arrow
&pyarrow
to6.0.1
(#9686) @galipremsagar
🐛 Bug Fixes
- Fix an issue with tdigest merge aggregations. (#10506) @nvdbaranec
- Batch of fixes for index overflows in grid stride loops. (#10448) @nvdbaranec
- Update dask_cudf imports to be compatible with latest dask (#10442) @rlratzel
- Fix for integer overflow in contiguous-split (#10437) @jbrennan333
- Fix has_null predicate for drop_list_duplicates on nested structs (#10436) @sperlingxx
- Fix empty reduce with List output and non-List input (#10435) @sperlingxx
- Fix
list
andstruct
meta generation issue indask-cudf
(#10434) @galipremsagar - Fix error in
cudf.to_numeric
when abool
input is passed (#10431) @galipremsagar - Support cupy array in
quantile
input (#10429) @galipremsagar - Fix benchmarks to work with new aggregation types (#10428) @davidwendt
- Fix cudf::shift to handle offset greater than column size (#10414) @davidwendt
- Fix lifespan of the temporary directory that holds cuFile configuration file (#10403) @vuule
- Fix error thrown in compiled-binaryop benchmark (#10398) @davidwendt
- Limiting async allocator using alignment of 512 (#10395) @rongou
- Include <optional> in multibyte split. (#10385) @bdice
- Fix issue with column and scalar re-assignment (#10377) @galipremsagar
- Fix floating point data generation in benchmarks (#10372) @vuule
- Avoid overflow in fused_concatenate_kernel output_index (#10344) @abellina
- Remove is_relationally_comparable for table device views (#10342) @davidwendt
- Fix debug compile error in device_span to column_view conversion (#10331) @davidwendt
- Add Pascal support to JCUDF transcode (row_conversion) (#10329) @mythrocks
- Fix
std::bad_alloc
exception due to JIT reserving a huge buffer (#10317) @ttnghia - Fixes up the overflowed fixed-point round on nullable column (#10316) @sperlingxx
- Fix DataFrame slicing issues for empty cases (#10310) @brandon-b-miller
- Fix documentation issues (#10307) @ajschmidt8
- Allow Java bindings to use default decimal precisions when writing columns (#10276) @sperlingxx
- Fix incorrect slicing of GDS read/write calls (#10274) @vuule
- Fix out-of-memory error in compiled-binaryop benchmark (#10269) @davidwendt
- Add tests of reflected ufuncs and fix behavior of logical reflected ufuncs (#10261) @vyasr
- Remove probe-time null equality parameters in
cudf::hash_join
(#10260) @PointKernel - Fix out-of-memory error in UrlDecode benchmark (#10258) @davidwendt
- Fix groupby reductions that perform operations on source type instead of target type (#10250) @ttnghia
- Fix small leak in explode (#10245) @revans2
- Yet another small JNI memory leak (#10238) @revans2
- Fix regex octal parsing to limit to 3 characters (#10233) @davidwendt
- Fix string to decimal128 conversion handling large exponents (#10231) @davidwendt
- Fix JNI leak on copy to device (#10229) @revans2
- Fix the data generator element size for decimal types (#10225) @vuule
- Fix
decimal
metadata in parquet writer (#10224) @galipremsagar - Fix strings handling of hex in regex pattern (#10220) @davidwendt
- Fix docs builds (#10216) @ajschmidt8
- Fix a leftover _has_nulls change from Nullate (#10211) @devavret
- Fix bitmask of the output for JNI of
lists::drop_list_duplicates
(#10210) @ttnghia - Fix compile error in
binaryop/compiled/util.cpp
(#10209) @ttnghia - Skip ORC and Parquet readers' benchmark cases that are not currently supported (#10194) @vuule
- Fix JNI leak of a cudf::column_view native class. (#10171) @revans2
- Enable proper
Index
round-tripping inorc
reader and writer (#10170) @galipremsagar - Convert Column Name to String Before Using Struct Column Factory (#10156) @isVoid
- Preserve the correct
ListDtype
while creating an identical empty column (#10151) @galipremsagar - benchmark fixture - static object pointer fix (#10145) @karthikeyann
- Fix UDF Caching (#10133) @brandon-b-miller
- Raise duplicate column error in
DataFrame.rename
(#10120) @galipremsagar - Fix flaky memory usage test by guaranteeing array size. (#10114) @vyasr
- Encode values from python callback for C++ (#10103) @jdye64
- Add check for regex instructions causing an infinite-loop (#10095) @davidwendt
- Remove metadata singleton from nvtext normalizer (#10090) @davidwendt
- Column equality testing fixes (#10011) @brandon-b-miller
- Pin libcudf runtime dependency for cudf / libcudf-kafka nightlies (#9847) @charlesbluca
📖 Documentation
- Fix documentation for DataFrame.corr and Series.corr. (#10493) @bdice
- Add
cut
to API docs (#10479) @shwina - Remove documentation for methods removed in #10124. (#10366) @bdice
- Fix documentation issues (#10306) @ajschmidt8
- Fix
fixed_point
binary operation documentation (#10198) @codereport - Remove cleaned up methods from docs (#10189) @galipremsagar
- Update developer guide to recommend no default stream parameter. (#10136) @bdice
- Update benchmarking guide to use NVBench. (#10093) @bdice
🚀 New Features
- Add StringIO support to read_text (#10465) @cwharris
- Add support for tdigest and merge_tdigest aggregations through cudf::reduce (#10433) @nvdbaranec
- JNI support for Collect Ops in Reduction (#10427) @sperlingxx
- Enable read_text with dask_cudf using byte_range (#10407) @ChrisJar
- Add
cudf::stable_sort_by_key
(#10387) @PointKernel - Implement
maps_column_view
abstraction overLIST<STRUCT<K,V>>
(#10380) @mythrocks - Support Java bindings for Avro reader (#10373) @HaoYang670
- Refactor stream compaction APIs (#10370) @PointKernel
- Support collect aggregations in reduction (#10353) @sperlingxx
- Refactor array_ufunc for Index and unify across all classes (#10346) @vyasr
- Add JNI for extract_list_element with index column (#10341) @firestarman
- Support
min
andmax
operations for structs in rolling window (#10332) @ttnghia - Add device create_sequence_table for benchmarks (#10300) @karthikeyann
- Enable numpy ufuncs for DataFrame (#10287) @vyasr
- move input generation for json benchmark to device (#10281) @karthikeyann
- move input generation for type dispatcher benchmark to device (#10280) @karthikeyann
- move input generation for copy benchmark to device (#10279) @karthikeyann
- generate url decode benchmark input in device (#10278) @karthikeyann
- device input generation in join bench (#10277) @karthikeyann
- Add nvtext::byte_pair_encoding API (#10270) @davidwendt
- Prevent internal usage of expensive APIs (#10263) @vyasr
- Column to JCUDF row for tables with strings (#10235) @hyperbolic2346
- Support
percent_rank()
aggregation (#10227) @mythrocks - Refactor Series.array_ufunc (#10217) @vyasr
- Reduce pytest runtime (#10203) @brandon-b-miller
- Add regex flags parameter to python cudf strings split (#10185) @davidwendt
- Support for
MOD
,PMOD
andPYMOD
fordecimal32/64/128
(#10179) @codereport - Adding string row size iterator for row to column and column to row conversion (#10157) @hyperbolic2346
- Add file size counter to cuIO benchmarks (#10154) @vuule
- byte_range support for multibyte_split/read_text (#10150) @cwharris
- Add JNI for
strings::split_re
andstrings::split_record_re
(#10139) @ttnghia - Add
maxSplit
parameter to Java binding forstrings:split
(#10137) @ttnghia - Add libcudf strings split API that accepts regex pattern (#10128) @davidwendt
- generate benchmark input in device (#10109) @karthikeyann
- Avoid
nan_as_null
op ifnan_count
is 0 (#10082) @galipremsagar - Add Dataframe and Index nunique (#10077) @martinfalisse
- Support nanosecond timestamps in parquet (#10063) @PointKernel
- Java bindings for mixed semi and anti joins (#10040) @jlowe
- Implement mixed equality/conditional semi/anti joins (#10037) @vyasr
- Optimize compaction operations (#10030) @PointKernel
- Support
args=
inSeries.apply
(#9982) @brandon-b-miller - Add cudf::strings::findall_record API (#9911) @davidwendt
- Add covariance for sort groupby (python) (#9889) @mayankanand007
- Implement DataFrame diff() (#9817) @skirui-source
- Implement DataFrame pct_change (#9805) @skirui-source
- Support segmented reductions and null mask reductions (#9621) @isVoid
- Add 'spearman' correlation method for
dataframe.corr
andseries.corr
(#7141) @dominicshanshan
🛠️ Improvements
- Add
scipy
skip for a test (#10502) @galipremsagar - Temporarily disable new
ops-bot
functionality (#10496) @ajschmidt8 - Include <cstddef> to fix compilation of parquet reader on GCC 11. (#10483) @bdice
- Pin
dask
anddistributed
(#10481) @galipremsagar - MD5 refactoring. (#10445) @bdice
- Remove or split up Frame methods that use the index (#10439) @vyasr
- Centralization of tdigest aggregation code. (#10422) @nvdbaranec
- Simplify column binary operations (#10421) @vyasr
- Add
.github/ops-bot.yaml
config file (#10420) @ajschmidt8 - Use list of columns for methods in
Groupby.pyx
(#10419) @isVoid - Remov...
v22.02.00
🚨 Breaking Changes
- ORC writer API changes for granular statistics (#10058) @mythrocks
decimal128
Support forto/from_arrow
(#9986) @codereport- Remove deprecated method
one_hot_encoding
(#9977) @isVoid - Remove str.subword_tokenize (#9968) @VibhuJawa
- Remove deprecated
method
parameter frommerge
andjoin
. (#9944) @bdice - Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
- Remove deprecated method Series.hash_encode. (#9942) @bdice
- Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
- Introduce
nan_as_null
parameter forcudf.Index
(#9893) @galipremsagar - Add regex_flags parameter to strings replace_re functions (#9878) @davidwendt
- Break tie for
top
categorical columns inSeries.describe
(#9867) @isVoid - Add partitioning support in parquet writer (#9810) @devavret
- Move
drop_duplicates
,drop_na
,_gather
,take
to IndexFrame and create their_base_index
counterparts (#9807) @isVoid - Raise temporary error for
decimal128
types in parquet reader (#9804) @galipremsagar - Change default
dtype
of all nulls column fromfloat
toobject
(#9803) @galipremsagar - Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
- Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
- Add decimal128 support to Parquet reader and writer (#9765) @vuule
- Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
- Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
- Match pandas scalar result types in reductions (#9717) @brandon-b-miller
- Add parameters to control row group size in Parquet writer (#9677) @vuule
- Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
- Add support for
decimal128
in cudf python (#9533) @galipremsagar - Implement
lists::index_of()
to find positions in list rows (#9510) @mythrocks - Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346
🐛 Bug Fixes
- Add check for negative stripe index in ORC reader (#10074) @vuule
- Update Java tests to expect DECIMAL128 from Arrow (#10073) @jlowe
- Avoid index materialization when
DataFrame
is created with un-namedSeries
objects (#10071) @galipremsagar - fix gcc 11 compilation errors (#10067) @rongou
- Fix
columns
ordering issue in parquet reader (#10066) @galipremsagar - Fix dataframe setitem with
ndarray
types (#10056) @galipremsagar - Remove implicit copy due to conversion from cudf::size_type and size_t (#10045) @robertmaynard
- Include <optional> in headers that use std::optional (#10044) @robertmaynard
- Fix repr and concat of
StructColumn
(#10042) @galipremsagar - Include row group level stats when writing ORC files (#10041) @vuule
- build.sh respects the
--build_metrics
and--incl_cache_stats
flags (#10035) @robertmaynard - Fix memory leaks in JNI native code. (#10029) @mythrocks
- Update JNI to use new arena mr constructor (#10027) @rongou
- Fix null check when comparing structs in
arg_min
operation of reduction/groupby (#10026) @ttnghia - Wrap CI script shell variables in quotes to fix local testing. (#10018) @bdice
- cudftestutil no longer propagates compiler flags to external users (#10017) @robertmaynard
- Remove
CUDA_DEVICE_CALLABLE
macro usage (#10015) @hyperbolic2346 - Add missing list filling header in meta.yaml (#10007) @devavret
- Fix
conda
recipes forcustreamz
&cudf_kafka
(#10003) @ajschmidt8 - Fix matching regex word-boundary (\b) in strings replace (#9997) @davidwendt
- Fix null check when comparing structs in
min
andmax
reduction/groupby operations (#9994) @ttnghia - Fix octal pattern matching in regex string (#9993) @davidwendt
decimal128
Support forto/from_arrow
(#9986) @codereport- Fix groupby shift/diff/fill after selecting from a
GroupBy
(#9984) @shwina - Fix the overflow problem of decimal rescale (#9966) @sperlingxx
- Use default value for decimal precision in parquet writer when not specified (#9963) @devavret
- Fix cudf java build error. (#9958) @firestarman
- Use gpuci_mamba_retry to install local artifacts. (#9951) @bdice
- Fix regression HostColumnVectorCore requiring native libs (#9948) @jlowe
- Rename aggregate_metadata in writer to fix name collision (#9938) @devavret
- Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. (#9931) @nvdbaranec
- Resolve racecheck errors in ORC kernels (#9916) @vuule
- Fix the java build after parquet partitioning support (#9908) @revans2
- Fix compilation of benchmark for parquet writer. (#9905) @bdice
- Fix a memcheck error in ORC writer (#9896) @vuule
- Introduce
nan_as_null
parameter forcudf.Index
(#9893) @galipremsagar - Fix fallback to sort aggregation for grouping only hash aggregate (#9891) @abellina
- Add zlib to cudfjni link when using static libcudf library dependency (#9890) @jlowe
- TimedeltaIndex constructor raises an AttributeError. (#9884) @skirui-source
- Fix cudf.Scalar string datetime construction (#9875) @brandon-b-miller
- Load libcufile.so with RTLD_NODELETE flag (#9872) @vuule
- Break tie for
top
categorical columns inSeries.describe
(#9867) @isVoid - Fix null handling for structs
min
andarg_min
in groupby, groupby scan, reduction, and inclusive_scan (#9864) @ttnghia - Add one-level list encoding support in parquet reader (#9848) @PointKernel
- Fix an out-of-bounds read in validity copying in contiguous_split. (#9842) @nvdbaranec
- Fix join of MultiIndex to Index with one column and overlapping name. (#9830) @vyasr
- Fix caching in
Series.applymap
(#9821) @brandon-b-miller - Enforce boolean
ascending
for dask-cudfsort_values
(#9814) @charlesbluca - Fix ORC writer crash with empty input columns (#9808) @vuule
- Change default
dtype
of all nulls column fromfloat
toobject
(#9803) @galipremsagar - Load native dependencies when Java ColumnView is loaded (#9800) @jlowe
- Fix dtype-argument bug in dask_cudf read_csv (#9796) @rjzamora
- Fix overflow for min calculation in strings::from_timestamps (#9793) @revans2
- Fix memory error due to lambda return type deduction limitation (#9778) @karthikeyann
- Revert regex $/EOL end-of-string new-line special case handling (#9774) @davidwendt
- Fix missing streams (#9767) @karthikeyann
- Fix make_empty_scalar_like on list_type (#9759) @sperlingxx
- Update cmake and conda to 22.02 (#9746) @devavret
- Fix out-of-bounds memory write in decimal128-to-string conversion (#9740) @davidwendt
- Match pandas scalar result types in reductions (#9717) @brandon-b-miller
- Fix regex non-multiline EOL/$ matching strings ending with a new-line (#9715) @davidwendt
- Fixed build by adding more checks for int8, int16 (#9707) @razajafri
- Fix
null
handling whenboolean
dtype is passed (#9691) @galipremsagar - Fix stream usage in
segmented_gather()
(#9679) @mythrocks
📖 Documentation
- Update
decimal
dtypes related docs entries (#10072) @galipremsagar - Fix regex doc describing hexadecimal escape characters (#10009) @davidwendt
- Fix cudf compilation instructions. (#9956) @esoha-nvidia
- Fix see also links for IO APIs (#9895) @galipremsagar
- Fix build instructions for libcudf doxygen (#9837) @davidwendt
- Fix some doxygen warnings and add missing documentation (#9770) @karthikeyann
- update cuda version in local build (#9736) @karthikeyann
- Fix doxygen for enum types in libcudf (#9724) @davidwendt
- Spell check fixes (#9682) @karthikeyann
- Fix links in C++ Developer Guide. (#9675) @bdice
🚀 New Features
- Remove libcudacxx patch needed for nvcc 11.4 (#10057) @robertmaynard
- Allow CuPy 10 (#10048) @jakirkham
- Add in support for NULL_LOGICAL_AND and NULL_LOGICAL_OR binops (#10016) @revans2
- Add
groupby.transform
(only support for aggregations) (#10005) @shwina - Add partitioning support to Parquet chunked writer (#10000) @devavret
- Add jni for sequences (#9972) @wbo4958
- Java bindings for mixed left, inner, and full joins (#9941) @jlowe
- Java bindings for JSON reader support (#9940) @wbo4958
- Enable transpose for string columns in cudf python (#9937) @galipremsagar
- Support structs for
cudf::contains
with column/scalar input (#9929) @ttnghia - Implement mixed equality/conditional joins (#9917) @vyasr
- Add cudf::strings::extract_all API (#9909) @davidwendt
- Implement JNI for
cudf::scatter
APIs (#9903) @ttnghia - JNI: Function to copy and set validity from bool column. (#9901) @mythrocks
- Add dictionary support to cudf::copy_if_else (#9887) @davidwendt
- add run_benchmarks target for running benchmarks with json output (#9879) @karthikeyann
- Add regex_flags parameter to strings replace_re functions (#9878) @davidwendt
- Add_suffix and add_prefix for DataFrames and Series (#9846) @mayankanand007
- Add JNI for
cudf::drop_duplicates
(#9841) @ttnghia - Implement per-list sequence (#9839) @ttnghia
- adding
series.transpose
(#9835) @mayankanand007 - Adding support for
Series.autocorr
(#9833) @mayankanand007 - Support round operation on datetime64 datatypes (#9820) @mayankanand007
- Add partitioning support in parquet writer (#9810) @devavret
- Raise temporary error for
decimal128
types in parquet reader (#9804) @galipremsagar - Add decimal128 support to Parquet reader and writer (#9765) @vuule
- Optimize
groupby::scan
(#9754) @PointKernel - Add sample JNI API (#9728) @res-life
- Support
min
andmax
in inclusive scan for structs (#9725) @ttnghia - Add
first
andlast
method toIndexedFrame
(#9710) @isVoid - Support
min
andmax
reduction for structs (#9697) @ttnghia - Add parameters to control row group size in Parquet writer (#9677) @vuule
- Run compute-sanitizer in nightly build (#9641) @karthikeyann
- Implement Series.datetime.floor (#9571) @skirui-source
- ceil/floor for
DatetimeIndex
(#9554) @mayankanand007 - Add support for
decimal128
in cudf python (#9533) @galipremsagar - ...