v22.06.00
🚨 Breaking Changes
- Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
- Rename
sliced_child
toget_sliced_child
. (#10885) @bdice - Add parameters to control page size in Parquet writer (#10882) @etseidl
- Make cudf::test::expect_columns_equal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
- Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
- Refactor
cudf::contains
, renaming and switching parameters role (#10802) @ttnghia - Generic serialization of all column types (#10784) @wence-
- Return per-file metadata from readers (#10782) @vuule
- HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
- Update
groupby::hash
to use new row operators for keys (#10770) @PointKernel - update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
- Rename CUDA_TRY macro to CUDF_CUDA_TRY, rename CHECK_CUDA macro to CUDF_CHECK_CUDA. (#10589) @bdice
- Upgrade
cudf
to supportpandas
1.4.x versions (#10584) @galipremsagar - Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
- Add default= kwarg to .list.get() accessor method (#10547) @shwina
- Remove deprecated
decimal_cols_as_float
in the ORC reader (#10515) @vuule - Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
- Fix findall_record to return empty list for no matches (#10491) @davidwendt
- Namespace/Docstring Fixes for Reduction (#10471) @isVoid
- Additional refactoring of hash functions (#10462) @bdice
- Fix default value of str.split expand parameter. (#10457) @bdice
- Remove deprecated code. (#10450) @vyasr
🐛 Bug Fixes
- Fix single column
MultiIndex
issue insort_index
(#10957) @galipremsagar - Make SerializedTableHeader(numRows) public (#10949) @gerashegalov
- Fix
gcc_linux
version pinning in dev environment (#10943) @galipremsagar - Fix an issue with reading raw string in
cudf.read_json
(#10924) @galipremsagar - Make cudf::test::expect_columns_equal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
- Fix segmented_reduce on empty column with non-empty offsets (#10876) @davidwendt
- Fix dask-cudf groupby handling when grouping by all columns (#10866) @charlesbluca
- Fix a bug in
distinct
: using nested nulls logic (#10848) @PointKernel - Fix constness / references in weak ordering operator() signatures. (#10846) @bdice
- Suppress sizeof-array-div warnings in thrust found by gcc-11 (#10840) @robertmaynard
- Add handling for string by-columns in dask-cudf groupby (#10830) @charlesbluca
- Fix compile warning in search.cu (#10827) @davidwendt
- Fix element access const correctness in
hostdevice_vector
(#10804) @vuule - Update
cuco
git tag (#10788) @PointKernel - HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
- Fixing deprecation warnings in test_orc.py (#10772) @hyperbolic2346
- Enable writing to
s3
storage in chunked parquet writer (#10769) @galipremsagar - Fix construction of nested structs with EMPTY child (#10761) @shwina
- Fix replace error when regex has only zero match quantifiers (#10760) @davidwendt
- Fix an issue with one_level_list schemas in parquet reader. (#10750) @nvdbaranec
- update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
- Fix
cupy
function in notebook (#10737) @ajschmidt8 - Fix
fillna
to retaincolumns
when it isMultiIndex
(#10729) @galipremsagar - Fix scatter for all-empty-string column case (#10724) @davidwendt
- Retain series name in
Series.apply
(#10716) @brandon-b-miller - Correct build dir
cudf-config
dependency issues for static builds (#10704) @robertmaynard - Fix list of testing requirements in setup.py. (#10678) @bdice
- Fix rounding to zero error in stod on very small float numbers (#10672) @davidwendt
- cuco isn't a cudf dependency when we are built shared (#10662) @robertmaynard
- Fix to_timestamps to support Z for %z format specifier (#10617) @davidwendt
- Verify compression type in Parquet reader (#10610) @vuule
- Fix struct row comparator's exception on empty structs (#10604) @sperlingxx
- Fix strings strip() to accept only str Scalar for to_strip parameter (#10597) @davidwendt
- Fix has_atomic_support check in can_use_hash_groupby() (#10588) @jbrennan333
- Revert Thrust 1.16 to Thrust 1.15 (#10586) @bdice
- Fix missing RMM_STATIC_CUDART define when compiling JNI with static CUDA runtime (#10585) @jlowe
- pin more cmake versions (#10570) @robertmaynard
- Re-enable Build Metrics Report (#10562) @davidwendt
- Remove statically linked CUDA runtime check in Java build (#10532) @jlowe
- Fix temp data cleanup in
test_text.py
(#10524) @brandon-b-miller - Update pre-commit to run black 22.3.0 (#10523) @vyasr
- Remove deprecated
decimal_cols_as_float
in the ORC reader (#10515) @vuule - Fix findall_record to return empty list for no matches (#10491) @davidwendt
- Allow users to specify data types for a subset of columns in
read_csv
(#10484) @vuule - Fix default value of str.split expand parameter. (#10457) @bdice
- Improve coverage of dask-cudf's groupby aggregation, add tests for
dropna
support (#10449) @charlesbluca - Allow string aggs for
dask_cudf.CudfDataFrameGroupBy.aggregate
(#10222) @charlesbluca - In-place updates with loc or iloc don't work correctly when the LHS has more than one column (#9918) @skirui-source
📖 Documentation
- Clarify append deprecation notice. (#10930) @bdice
- Use full name of GPUDirect Storage SDK in docs (#10904) @vuule
- Update Dask + Pandas to Dask + cuDF path (#10897) @miguelusque
- Add missing documentation in cudf/types.hpp (#10895) @karthikeyann
- Add strong index iterator docs. (#10888) @bdice
- spell check fixes (#10865) @karthikeyann
- Add missing documentation in scalar/ headers (#10861) @karthikeyann
- Remove typo in ngram documentation (#10859) @miguelusque
- fix doxygen warnings (#10842) @karthikeyann
- Add a library_design.md file documenting the core Python data structures and their relationship (#10817) @vyasr
- Add NumPy to intersphinx references. (#10809) @bdice
- Add a section to the docs that compares cuDF with Pandas (#10796) @shwina
- Mention 2 cpp-reviewer requirement in pull request template (#10768) @davidwendt
- Enable pydocstyle for all packages. (#10759) @bdice
- Enable pydocstyle rules involving quotes (#10748) @vyasr
- Revise 10 minutes notebook. (#10738) @bdice
- Reorganize cuDF Python docs (#10691) @shwina
- Fix sphinx/jupyter heading issue in UDF notebook (#10690) @brandon-b-miller
- Migrated user guide notebooks to MyST-NB and added sphinx extension (#10685) @mmccarty
- add data generation to benchmark documentation (#10677) @karthikeyann
- Fix some docs build warnings (#10674) @galipremsagar
- Update UDF notebook in User Guide. (#10668) @bdice
- Improve User Guide docs (#10663) @bdice
- Fix some docstrings formatting (#10660) @galipremsagar
- Remove implementation details from
apply
docstrings (#10651) @brandon-b-miller - Revise CONTRIBUTING.md (#10644) @bdice
- Add missing APIs to documentation. (#10643) @bdice
- Use cudf.read_json as documented API name. (#10640) @bdice
- Fix docstring section headings. (#10639) @bdice
- Document cudf.read_text and cudf.read_avro. (#10638) @bdice
- Fix type-o in docstring for json_reader_options (#10627) @dagardner-nv
- Update guide to UDFs with notes about
Series.applymap
deprecation and related changes (#10607) @brandon-b-miller - Fix doxygen Modules page for cudf::lists::sequences (#10561) @davidwendt
- Add Replace Backreferences section to Regex Features page (#10560) @davidwendt
- Introduce deprecation policy to developer guide. (#10252) @vyasr
🚀 New Features
- Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
- Handle nested types in cudf::concatenate_rows() (#10890) @nvdbaranec
- Strong index types for equality comparator (#10883) @ttnghia
- Add parameters to control page size in Parquet writer (#10882) @etseidl
- Support for Zstandard decompression in ORC reader (#10873) @vuule
- Use pre-built nvcomp 2.3 binaries by default (#10851) @robertmaynard
- Support for Zstandard decompression in Parquet reader (#10847) @vuule
- Add JNI support for apply_boolean_mask (#10812) @res-life
- Segmented Min/Max for Fixed Point Types (#10794) @isVoid
- Return per-file metadata from readers (#10782) @vuule
- Segmented
apply_boolean_mask
forLIST
columns (#10773) @mythrocks - Update
groupby::hash
to use new row operators for keys (#10770) @PointKernel - Support purging non-empty null elements from LIST/STRING columns (#10701) @mythrocks
- Add
detail::hash_join
(#10695) @PointKernel - Persist string statistics data across multiple calls to orc chunked write (#10694) @hyperbolic2346
- Add
.list.astype()
to cast list leaves to specified dtype (#10693) @shwina - JNI: Add generateListOffsets API (#10683) @sperlingxx
- Support
args
in groupby apply (#10682) @brandon-b-miller - Enable segmented_gather in Java package (#10669) @sperlingxx
- Add row hasher with nested column support (#10641) @devavret
- Add support for numeric_only in DataFrame._reduce (#10629) @martinfalisse
- First step toward statistics in ORC files with chunked writes (#10567) @hyperbolic2346
- Add support for struct columns to the random table generator (#10566) @vuule
- Enable passing a sequence for the
index
argument to.list.get()
(#10564) @shwina - Add python bindings for cudf::list::index_of (#10549) @ChrisJar
- Add default= kwarg to .list.get() accessor method (#10547) @shwina
- Add
cudf.DataFrame.applymap
(#10542) @brandon-b-miller - Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
- Add column field ID control in parquet writer (#10504) @PointKernel
- Deprecate
Series.applymap
(#10497) @brandon-b-miller - Add option to drop cache in cuIO benchmarks (#10488) @vuule
- move benchmark input generation in device in reduction nvbench (#10486) @karthikeyann
- Support Segmented Min/Max Reduction on String Type (#10447) @isVoid
- List element Equality comparator (#10289) @devavret
- Implement all methods of groupby rank aggregation in libcudf, python (#9569) @karthikeyann
- Implement DataFrame.eval using libcudf ASTs (#8022) @vyasr
🛠️ Improvements
- Use
conda
compilers in env file (#10915) @galipremsagar - Remove C style artifacts in cuIO (#10886) @vuule
- Rename
sliced_child
toget_sliced_child
. (#10885) @bdice - Replace defaulted stream value for libcudf APIs that use NVCOMP (#10877) @jbrennan333
- Add more unit tests for
cudf::distinct
for nested types with sliced input (#10860) @ttnghia - Changing
list_view.cuh
tolist_view.hpp
(#10854) @ttnghia - More error checking in
from_dlpack
(#10850) @wence- - Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
- Adds the JNI call for Cuda.deviceSynchronize (#10839) @abellina
- Add missing cuda-python dependency to cudf (#10833) @bdice
- Change std::string parameters in cudf::strings APIs to std::string_view (#10832) @davidwendt
- Split up search.cu to improve compile time (#10831) @davidwendt
- Add tests for null scalar binaryops (#10828) @brandon-b-miller
- Cleanup regex compile optimize functions (#10825) @davidwendt
- Use
ThreadedMotoServer
instead ofsubprocess
in spinning ups3
server (#10822) @galipremsagar - Import
NA
frommissing
rather than usingcudf.NA
everywhere (#10821) @brandon-b-miller - Refactor regex builtin character-class identifiers (#10814) @davidwendt
- Change pattern parameter for regex APIs from std::string to std::string_view (#10810) @davidwendt
- Make the JNI API to get list offsets as a view public. (#10807) @revans2
- Add cudf JNI docker build github action (#10806) @pxLi
- Removed
mr
parameter from inplace bitmask operations (#10805) @AtlantaPepsi - Refactor
cudf::contains
, renaming and switching parameters role (#10802) @ttnghia - Handle closed property in IntervalDtype.from_pandas (#10798) @wence-
- Return weak orderings from
device_row_comparator
. (#10793) @rwlee - Rework
Scalar
imports (#10791) @brandon-b-miller - Enable ccache for cudfjni build in Docker (#10790) @gerashegalov
- Generic serialization of all column types (#10784) @wence-
- simplifying skiprows test in test_orc.py (#10783) @hyperbolic2346
- Use column_views instead of column_device_views in binary operations. (#10780) @bdice
- Add struct utility functions. (#10776) @bdice
- Add multiple rows to subword tokenizer benchmark (#10767) @davidwendt
- Refactor host decompression in ORC reader (#10764) @vuule
- Flush output streams before creating a process to drop caches (#10762) @vuule
- Refactor binaryop/compiled/util.cpp (#10756) @bdice
- Use warp per string for long strings in cudf::strings::contains() (#10739) @davidwendt
- Use generator expressions in any/all functions. (#10736) @bdice
- Use canonical "magic methods" (replace
x.__repr__()
withrepr(x)
). (#10735) @bdice - Improve use of isinstance. (#10734) @bdice
- Rename tests from multiIndex to multiindex. (#10732) @bdice
- Two-table comparators with strong index types (#10730) @bdice
- Replace std::make_pair with std::pair (C++17 CTAD) (#10727) @karthikeyann
- Use structured bindings instead of std::tie (#10726) @karthikeyann
- Missing
f
prefix on f-strings fix (#10721) @code-review-doctor - Add
max_file_size
parameter to chunked parquet dataset writer (#10718) @galipremsagar - Deprecate
merge_sorted
, change dask cudf usage to internal method (#10713) @isVoid - Prepare dask_cudf test_parquet.py for upcoming API changes (#10709) @rjzamora
- Remove or simplify various utility functions (#10705) @vyasr
- Allow building arrow with parquet and not python (#10702) @revans2
- Partial cuIO GPU decompression refactor (#10699) @vuule
- Cython API refactor:
merge.pyx
(#10698) @isVoid - Fix random string data length to become variable (#10697) @galipremsagar
- Add bindings for index_of with column search key (#10696) @ChrisJar
- Deprecate index merging (#10689) @vyasr
- Remove cudf::strings::string namespace (#10684) @davidwendt
- Standardize imports. (#10680) @bdice
- Standardize usage of collections.abc. (#10679) @bdice
- Cython API Refactor:
transpose.pyx
,sort.pyx
(#10675) @isVoid - Add device_memory_resource parameter to create_string_vector_from_column (#10673) @davidwendt
- Split up mixed-join kernels source files (#10671) @davidwendt
- Use
std::filesystem
for temporary directory location and deletion (#10664) @vuule - cleanup benchmark includes (#10661) @karthikeyann
- Use upstream clang-format pre-commit hook. (#10659) @bdice
- Clean up C++ includes to use <> instead of "". (#10658) @bdice
- Handle RuntimeError thrown by CUDA Python in
validate_setup
(#10653) @shwina - Rework JNI CMake to leverage rapids_find_package (#10649) @jlowe
- Use conda to build python packages during GPU tests (#10648) @Ethyling
- Deprecate various functions that don't need to be defined for Index. (#10647) @vyasr
- Update pinning to allow newer CMake versions. (#10646) @vyasr
- Bump hadoop-common from 3.1.4 to 3.2.3 in /java (#10645) @dependabot[bot]
- Remove
concurrent_unordered_multimap
. (#10642) @bdice - Improve parquet dictionary encoding (#10635) @PointKernel
- Improve cudf::cuda_error (#10630) @sperlingxx
- Add support for null and non-numeric types in Series.diff and DataFrame.diff (#10625) @Matt711
- Branch 22.06 merge 22.04 (#10624) @vyasr
- Unpin
dask
&distributed
for development (#10623) @galipremsagar - Slightly improve accuracy of stod in to_floats (#10622) @davidwendt
- Allow libcudfjni to be built as a static library (#10619) @jlowe
- Change stack-based regex state data to use global memory (#10600) @davidwendt
- Resolve Forward merging of
branch-22.04
intobranch-22.06
(#10598) @galipremsagar - KvikIO as an alternative GDS backend (#10593) @madsbk
- Rename CUDA_TRY macro to CUDF_CUDA_TRY, rename CHECK_CUDA macro to CUDF_CHECK_CUDA. (#10589) @bdice
- Upgrade
cudf
to supportpandas
1.4.x versions (#10584) @galipremsagar - Refactor binary ops for timedelta and datetime columns (#10581) @vyasr
- Refactor cudf::strings::count_re API to use count_matches utility (#10580) @davidwendt
- Update
Programming Language :: Python
Versions to 3.8 & 3.9 (#10579) @madsbk - Automate Java cudf jar build with statically linked dependencies (#10578) @gerashegalov
- Add patch for thrust-cub 1.16 to fix sort compile times (#10577) @davidwendt
- Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
- Cleanup libcudf strings regex classes (#10573) @davidwendt
- Simplify preprocessing of arguments for DataFrame binops (#10563) @vyasr
- Reduce kernel calls to build strings findall results (#10559) @davidwendt
- Forward-merge branch-22.04 to branch-22.06 (#10557) @bdice
- Update strings contains benchmark to measure varying match rates (#10555) @davidwendt
- JNI: throw CUDA errors more specifically (#10551) @sperlingxx
- Enable building static libs (#10545) @trxcllnt
- Remove pip requirements files. (#10543) @bdice
- Remove Click pinnings that are unnecessary after upgrading black. (#10541) @vyasr
- Refactor
memory_usage
to improve performance (#10537) @galipremsagar - Adjust the valid range of group index for replace_with_backrefs (#10530) @sperlingxx
- add accidentally removed comment. (#10526) @vyasr
- Update conda environment. (#10525) @vyasr
- Remove ColumnBase.getitem (#10516) @vyasr
- Optimize
left_semi_join
by materializing the gather mask (#10511) @cheinger - Define proper binary operation APIs for columns (#10509) @vyasr
- Upgrade
arrow-cpp
&pyarrow
to7.0.0
(#10503) @galipremsagar - Update to Thrust 1.16 (#10489) @bdice
- Namespace/Docstring Fixes for Reduction (#10471) @isVoid
- Update cudfjni 22.06.0-SNAPSHOT (#10467) @pxLi
- Use Lists of Columns for Various Files (#10463) @isVoid
- Additional refactoring of hash functions (#10462) @bdice
- Fix Series.str.findall behavior for expand=False. (#10459) @bdice
- Remove deprecated code. (#10450) @vyasr
- Update cmake-format version. (#10440) @vyasr
- Consolidate C++
conda
recipes and addlibcudf-tests
package (#10326) @ajschmidt8 - Use conda compilers (#10275) @Ethyling
- Add row bitmask as a
detail::hash_join
member (#10248) @PointKernel