Release v22.06.00 · rapidsai/cudf

🚨 Breaking Changes

Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
Rename sliced_child to get_sliced_child. (#10885) @bdice
Add parameters to control page size in Parquet writer (#10882) @etseidl
Make cudf::test::expect_columns_equal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
Refactor cudf::contains, renaming and switching parameters role (#10802) @ttnghia
Generic serialization of all column types (#10784) @wence-
Return per-file metadata from readers (#10782) @vuule
HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
Update groupby::hash to use new row operators for keys (#10770) @PointKernel
update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
Rename CUDA_TRY macro to CUDF_CUDA_TRY, rename CHECK_CUDA macro to CUDF_CHECK_CUDA. (#10589) @bdice
Upgrade cudf to support pandas 1.4.x versions (#10584) @galipremsagar
Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
Add default= kwarg to .list.get() accessor method (#10547) @shwina
Remove deprecated decimal_cols_as_float in the ORC reader (#10515) @vuule
Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
Fix findall_record to return empty list for no matches (#10491) @davidwendt
Namespace/Docstring Fixes for Reduction (#10471) @isVoid
Additional refactoring of hash functions (#10462) @bdice
Fix default value of str.split expand parameter. (#10457) @bdice
Remove deprecated code. (#10450) @vyasr

🐛 Bug Fixes

Fix single column MultiIndex issue in sort_index (#10957) @galipremsagar
Make SerializedTableHeader(numRows) public (#10949) @gerashegalov
Fix gcc_linux version pinning in dev environment (#10943) @galipremsagar
Fix an issue with reading raw string in cudf.read_json (#10924) @galipremsagar
Make cudf::test::expect_columns_equal() to fail when comparing unsanitary lists. (#10880) @nvdbaranec
Fix segmented_reduce on empty column with non-empty offsets (#10876) @davidwendt
Fix dask-cudf groupby handling when grouping by all columns (#10866) @charlesbluca
Fix a bug in distinct: using nested nulls logic (#10848) @PointKernel
Fix constness / references in weak ordering operator() signatures. (#10846) @bdice
Suppress sizeof-array-div warnings in thrust found by gcc-11 (#10840) @robertmaynard
Add handling for string by-columns in dask-cudf groupby (#10830) @charlesbluca
Fix compile warning in search.cu (#10827) @davidwendt
Fix element access const correctness in hostdevice_vector (#10804) @vuule
Update cuco git tag (#10788) @PointKernel
HostColumnVectoreCore#isNull should return true for out-of-range rows (#10779) @gerashegalov
Fixing deprecation warnings in test_orc.py (#10772) @hyperbolic2346
Enable writing to s3 storage in chunked parquet writer (#10769) @galipremsagar
Fix construction of nested structs with EMPTY child (#10761) @shwina
Fix replace error when regex has only zero match quantifiers (#10760) @davidwendt
Fix an issue with one_level_list schemas in parquet reader. (#10750) @nvdbaranec
update mangle_dupe_cols behavior in csv reader to match pandas 1.4.0 behavior (#10749) @karthikeyann
Fix cupy function in notebook (#10737) @ajschmidt8
Fix fillna to retain columns when it is MultiIndex (#10729) @galipremsagar
Fix scatter for all-empty-string column case (#10724) @davidwendt
Retain series name in Series.apply (#10716) @brandon-b-miller
Correct build dir cudf-config dependency issues for static builds (#10704) @robertmaynard
Fix list of testing requirements in setup.py. (#10678) @bdice
Fix rounding to zero error in stod on very small float numbers (#10672) @davidwendt
cuco isn't a cudf dependency when we are built shared (#10662) @robertmaynard
Fix to_timestamps to support Z for %z format specifier (#10617) @davidwendt
Verify compression type in Parquet reader (#10610) @vuule
Fix struct row comparator's exception on empty structs (#10604) @sperlingxx
Fix strings strip() to accept only str Scalar for to_strip parameter (#10597) @davidwendt
Fix has_atomic_support check in can_use_hash_groupby() (#10588) @jbrennan333
Revert Thrust 1.16 to Thrust 1.15 (#10586) @bdice
Fix missing RMM_STATIC_CUDART define when compiling JNI with static CUDA runtime (#10585) @jlowe
pin more cmake versions (#10570) @robertmaynard
Re-enable Build Metrics Report (#10562) @davidwendt
Remove statically linked CUDA runtime check in Java build (#10532) @jlowe
Fix temp data cleanup in test_text.py (#10524) @brandon-b-miller
Update pre-commit to run black 22.3.0 (#10523) @vyasr
Remove deprecated decimal_cols_as_float in the ORC reader (#10515) @vuule
Fix findall_record to return empty list for no matches (#10491) @davidwendt
Allow users to specify data types for a subset of columns in read_csv (#10484) @vuule
Fix default value of str.split expand parameter. (#10457) @bdice
Improve coverage of dask-cudf's groupby aggregation, add tests for dropna support (#10449) @charlesbluca
Allow string aggs for dask_cudf.CudfDataFrameGroupBy.aggregate (#10222) @charlesbluca
In-place updates with loc or iloc don't work correctly when the LHS has more than one column (#9918) @skirui-source

📖 Documentation

Clarify append deprecation notice. (#10930) @bdice
Use full name of GPUDirect Storage SDK in docs (#10904) @vuule
Update Dask + Pandas to Dask + cuDF path (#10897) @miguelusque
Add missing documentation in cudf/types.hpp (#10895) @karthikeyann
Add strong index iterator docs. (#10888) @bdice
spell check fixes (#10865) @karthikeyann
Add missing documentation in scalar/ headers (#10861) @karthikeyann
Remove typo in ngram documentation (#10859) @miguelusque
fix doxygen warnings (#10842) @karthikeyann
Add a library_design.md file documenting the core Python data structures and their relationship (#10817) @vyasr
Add NumPy to intersphinx references. (#10809) @bdice
Add a section to the docs that compares cuDF with Pandas (#10796) @shwina
Mention 2 cpp-reviewer requirement in pull request template (#10768) @davidwendt
Enable pydocstyle for all packages. (#10759) @bdice
Enable pydocstyle rules involving quotes (#10748) @vyasr
Revise 10 minutes notebook. (#10738) @bdice
Reorganize cuDF Python docs (#10691) @shwina
Fix sphinx/jupyter heading issue in UDF notebook (#10690) @brandon-b-miller
Migrated user guide notebooks to MyST-NB and added sphinx extension (#10685) @mmccarty
add data generation to benchmark documentation (#10677) @karthikeyann
Fix some docs build warnings (#10674) @galipremsagar
Update UDF notebook in User Guide. (#10668) @bdice
Improve User Guide docs (#10663) @bdice
Fix some docstrings formatting (#10660) @galipremsagar
Remove implementation details from apply docstrings (#10651) @brandon-b-miller
Revise CONTRIBUTING.md (#10644) @bdice
Add missing APIs to documentation. (#10643) @bdice
Use cudf.read_json as documented API name. (#10640) @bdice
Fix docstring section headings. (#10639) @bdice
Document cudf.read_text and cudf.read_avro. (#10638) @bdice
Fix type-o in docstring for json_reader_options (#10627) @dagardner-nv
Update guide to UDFs with notes about Series.applymap deprecation and related changes (#10607) @brandon-b-miller
Fix doxygen Modules page for cudf::lists::sequences (#10561) @davidwendt
Add Replace Backreferences section to Regex Features page (#10560) @davidwendt
Introduce deprecation policy to developer guide. (#10252) @vyasr

🚀 New Features

Enable Zstandard decompression only when all nvcomp integrations are enabled (#10944) @vuule
Handle nested types in cudf::concatenate_rows() (#10890) @nvdbaranec
Strong index types for equality comparator (#10883) @ttnghia
Add parameters to control page size in Parquet writer (#10882) @etseidl
Support for Zstandard decompression in ORC reader (#10873) @vuule
Use pre-built nvcomp 2.3 binaries by default (#10851) @robertmaynard
Support for Zstandard decompression in Parquet reader (#10847) @vuule
Add JNI support for apply_boolean_mask (#10812) @res-life
Segmented Min/Max for Fixed Point Types (#10794) @isVoid
Return per-file metadata from readers (#10782) @vuule
Segmented apply_boolean_mask for LIST columns (#10773) @mythrocks
Update groupby::hash to use new row operators for keys (#10770) @PointKernel
Support purging non-empty null elements from LIST/STRING columns (#10701) @mythrocks
Add detail::hash_join (#10695) @PointKernel
Persist string statistics data across multiple calls to orc chunked write (#10694) @hyperbolic2346
Add .list.astype() to cast list leaves to specified dtype (#10693) @shwina
JNI: Add generateListOffsets API (#10683) @sperlingxx
Support args in groupby apply (#10682) @brandon-b-miller
Enable segmented_gather in Java package (#10669) @sperlingxx
Add row hasher with nested column support (#10641) @devavret
Add support for numeric_only in DataFrame._reduce (#10629) @martinfalisse
First step toward statistics in ORC files with chunked writes (#10567) @hyperbolic2346
Add support for struct columns to the random table generator (#10566) @vuule
Enable passing a sequence for the index argument to .list.get() (#10564) @shwina
Add python bindings for cudf::list::index_of (#10549) @ChrisJar
Add default= kwarg to .list.get() accessor method (#10547) @shwina
Add cudf.DataFrame.applymap (#10542) @brandon-b-miller
Support nvComp 2.3 if local, otherwise use nvcomp 2.2 (#10513) @robertmaynard
Add column field ID control in parquet writer (#10504) @PointKernel
Deprecate Series.applymap (#10497) @brandon-b-miller
Add option to drop cache in cuIO benchmarks (#10488) @vuule
move benchmark input generation in device in reduction nvbench (#10486) @karthikeyann
Support Segmented Min/Max Reduction on String Type (#10447) @isVoid
List element Equality comparator (#10289) @devavret
Implement all methods of groupby rank aggregation in libcudf, python (#9569) @karthikeyann
Implement DataFrame.eval using libcudf ASTs (#8022) @vyasr

🛠️ Improvements

Use conda compilers in env file (#10915) @galipremsagar
Remove C style artifacts in cuIO (#10886) @vuule
Rename sliced_child to get_sliced_child. (#10885) @bdice
Replace defaulted stream value for libcudf APIs that use NVCOMP (#10877) @jbrennan333
Add more unit tests for cudf::distinct for nested types with sliced input (#10860) @ttnghia
Changing list_view.cuh to list_view.hpp (#10854) @ttnghia
More error checking in from_dlpack (#10850) @wence-
Cleanup regex compiler fixed quantifiers source (#10843) @davidwendt
Adds the JNI call for Cuda.deviceSynchronize (#10839) @abellina
Add missing cuda-python dependency to cudf (#10833) @bdice
Change std::string parameters in cudf::strings APIs to std::string_view (#10832) @davidwendt
Split up search.cu to improve compile time (#10831) @davidwendt
Add tests for null scalar binaryops (#10828) @brandon-b-miller
Cleanup regex compile optimize functions (#10825) @davidwendt
Use ThreadedMotoServer instead of subprocess in spinning up s3 server (#10822) @galipremsagar
Import NA from missing rather than using cudf.NA everywhere (#10821) @brandon-b-miller
Refactor regex builtin character-class identifiers (#10814) @davidwendt
Change pattern parameter for regex APIs from std::string to std::string_view (#10810) @davidwendt
Make the JNI API to get list offsets as a view public. (#10807) @revans2
Add cudf JNI docker build github action (#10806) @pxLi
Removed mr parameter from inplace bitmask operations (#10805) @AtlantaPepsi
Refactor cudf::contains, renaming and switching parameters role (#10802) @ttnghia
Handle closed property in IntervalDtype.from_pandas (#10798) @wence-
Return weak orderings from device_row_comparator. (#10793) @rwlee
Rework Scalar imports (#10791) @brandon-b-miller
Enable ccache for cudfjni build in Docker (#10790) @gerashegalov
Generic serialization of all column types (#10784) @wence-
simplifying skiprows test in test_orc.py (#10783) @hyperbolic2346
Use column_views instead of column_device_views in binary operations. (#10780) @bdice
Add struct utility functions. (#10776) @bdice
Add multiple rows to subword tokenizer benchmark (#10767) @davidwendt
Refactor host decompression in ORC reader (#10764) @vuule
Flush output streams before creating a process to drop caches (#10762) @vuule
Refactor binaryop/compiled/util.cpp (#10756) @bdice
Use warp per string for long strings in cudf::strings::contains() (#10739) @davidwendt
Use generator expressions in any/all functions. (#10736) @bdice
Use canonical "magic methods" (replace x.__repr__() with repr(x)). (#10735) @bdice
Improve use of isinstance. (#10734) @bdice
Rename tests from multiIndex to multiindex. (#10732) @bdice
Two-table comparators with strong index types (#10730) @bdice
Replace std::make_pair with std::pair (C++17 CTAD) (#10727) @karthikeyann
Use structured bindings instead of std::tie (#10726) @karthikeyann
Missing f prefix on f-strings fix (#10721) @code-review-doctor
Add max_file_size parameter to chunked parquet dataset writer (#10718) @galipremsagar
Deprecate merge_sorted, change dask cudf usage to internal method (#10713) @isVoid
Prepare dask_cudf test_parquet.py for upcoming API changes (#10709) @rjzamora
Remove or simplify various utility functions (#10705) @vyasr
Allow building arrow with parquet and not python (#10702) @revans2
Partial cuIO GPU decompression refactor (#10699) @vuule
Cython API refactor: merge.pyx (#10698) @isVoid
Fix random string data length to become variable (#10697) @galipremsagar
Add bindings for index_of with column search key (#10696) @ChrisJar
Deprecate index merging (#10689) @vyasr
Remove cudf::strings::string namespace (#10684) @davidwendt
Standardize imports. (#10680) @bdice
Standardize usage of collections.abc. (#10679) @bdice
Cython API Refactor: transpose.pyx, sort.pyx (#10675) @isVoid
Add device_memory_resource parameter to create_string_vector_from_column (#10673) @davidwendt
Split up mixed-join kernels source files (#10671) @davidwendt
Use std::filesystem for temporary directory location and deletion (#10664) @vuule
cleanup benchmark includes (#10661) @karthikeyann
Use upstream clang-format pre-commit hook. (#10659) @bdice
Clean up C++ includes to use <> instead of "". (#10658) @bdice
Handle RuntimeError thrown by CUDA Python in validate_setup (#10653) @shwina
Rework JNI CMake to leverage rapids_find_package (#10649) @jlowe
Use conda to build python packages during GPU tests (#10648) @Ethyling
Deprecate various functions that don't need to be defined for Index. (#10647) @vyasr
Update pinning to allow newer CMake versions. (#10646) @vyasr
Bump hadoop-common from 3.1.4 to 3.2.3 in /java (#10645) @dependabot[bot]
Remove concurrent_unordered_multimap. (#10642) @bdice
Improve parquet dictionary encoding (#10635) @PointKernel
Improve cudf::cuda_error (#10630) @sperlingxx
Add support for null and non-numeric types in Series.diff and DataFrame.diff (#10625) @Matt711
Branch 22.06 merge 22.04 (#10624) @vyasr
Unpin dask & distributed for development (#10623) @galipremsagar
Slightly improve accuracy of stod in to_floats (#10622) @davidwendt
Allow libcudfjni to be built as a static library (#10619) @jlowe
Change stack-based regex state data to use global memory (#10600) @davidwendt
Resolve Forward merging of branch-22.04 into branch-22.06 (#10598) @galipremsagar
KvikIO as an alternative GDS backend (#10593) @madsbk
Rename CUDA_TRY macro to CUDF_CUDA_TRY, rename CHECK_CUDA macro to CUDF_CHECK_CUDA. (#10589) @bdice
Upgrade cudf to support pandas 1.4.x versions (#10584) @galipremsagar
Refactor binary ops for timedelta and datetime columns (#10581) @vyasr
Refactor cudf::strings::count_re API to use count_matches utility (#10580) @davidwendt
Update Programming Language :: Python Versions to 3.8 & 3.9 (#10579) @madsbk
Automate Java cudf jar build with statically linked dependencies (#10578) @gerashegalov
Add patch for thrust-cub 1.16 to fix sort compile times (#10577) @davidwendt
Move binop methods from Frame to IndexedFrame and standardize the docstring (#10576) @vyasr
Cleanup libcudf strings regex classes (#10573) @davidwendt
Simplify preprocessing of arguments for DataFrame binops (#10563) @vyasr
Reduce kernel calls to build strings findall results (#10559) @davidwendt
Forward-merge branch-22.04 to branch-22.06 (#10557) @bdice
Update strings contains benchmark to measure varying match rates (#10555) @davidwendt
JNI: throw CUDA errors more specifically (#10551) @sperlingxx
Enable building static libs (#10545) @trxcllnt
Remove pip requirements files. (#10543) @bdice
Remove Click pinnings that are unnecessary after upgrading black. (#10541) @vyasr
Refactor memory_usage to improve performance (#10537) @galipremsagar
Adjust the valid range of group index for replace_with_backrefs (#10530) @sperlingxx
add accidentally removed comment. (#10526) @vyasr
Update conda environment. (#10525) @vyasr
Remove ColumnBase.getitem (#10516) @vyasr
Optimize left_semi_join by materializing the gather mask (#10511) @cheinger
Define proper binary operation APIs for columns (#10509) @vyasr
Upgrade arrow-cpp & pyarrow to 7.0.0 (#10503) @galipremsagar
Update to Thrust 1.16 (#10489) @bdice
Namespace/Docstring Fixes for Reduction (#10471) @isVoid
Update cudfjni 22.06.0-SNAPSHOT (#10467) @pxLi
Use Lists of Columns for Various Files (#10463) @isVoid
Additional refactoring of hash functions (#10462) @bdice
Fix Series.str.findall behavior for expand=False. (#10459) @bdice
Remove deprecated code. (#10450) @vyasr
Update cmake-format version. (#10440) @vyasr
Consolidate C++ conda recipes and add libcudf-tests package (#10326) @ajschmidt8
Use conda compilers (#10275) @Ethyling
Add row bitmask as a detail::hash_join member (#10248) @PointKernel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v22.06.00

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors