Skip to content

Latest commit

 

History

History
345 lines (334 loc) · 36.2 KB

CHANGELOG.md

File metadata and controls

345 lines (334 loc) · 36.2 KB

Change log

Generated on 2025-02-17

Release 25.02

Features

#11648 [FEA] it would be nice if we could support org.apache.spark.sql.catalyst.expressions.Bin
#11891 [FEA] Support Spark 3.5.4 release
#11928 [FEA] make maxCpuBatchSize in GpuPartitioning configurable
#10505 [FOLLOW UP] Support row_number() filters for GpuWindowGroupLimitExec
#11853 [FEA] Ability to dump tables on a write
#11804 [FEA] Support TruncDate expression
#11674 [FEA] HiveHash supports nested types

Performance

#11342 [FEA] Put file writes in a background thread
#11729 [FEA] optimize the multi-contains generated by rlike
#11860 [FEA] kernel for date_trunc and trunc that has a scalar format
#11812 [FEA] Support escape characters in search list when rewrite regexp_replace to string replace

Bugs Fixed

#12091 [BUG]An assertion error in the sized hash join
#12096 [BUG] CI_PART1 for DBR 14.3 hangs in the nightly pre-release pipeline
#12076 [BUG] ExtraPlugins might be loaded duplicated
#11433 [BUG] Spark UT framework: SPARK-34212 Parquet should read decimals correctly
#12038 [BUG] spark321 failed core dump in nightly
#12046 [BUG] orc_test fail non-UTC cases with Part of the plan is not columnar class org.apache.spark.sql.execution.FileSourceScanExec
#12039 [BUG] HostAllocSuite failed "Maximum pool size exceeded" in nightly UT tests
#12036 [BUG] The check in assertIsOnTheGpu method to test if a plan is on the GPU is not accurate
#11989 [BUG] ParquetCachedBatchSerializer does not grab the GPU semaphore and does not have retry blocks
#11651 [BUG] Parse regular expressions using JDK to make error behavior more consistent between CPU and GPU
#11628 [BUG] Spark UT framework: select one deep nested complex field after join, IOException parsing parquet
#11629 [BUG] Spark UT framework: select one deep nested complex field after outer join, IOException parsing parquet
#11620 [BUG] Spark UT framework: "select a single complex field and partition column" causes java.lang.IndexOutOfBoundsException
#11621 [BUG] Spark UT framework: partial schema intersection - select missing subfield causes java.lang.IndexOutOfBoundsException
#11619 [BUG] Spark UT framework: "select a single complex field" causes java.lang.IndexOutOfBoundsException
#11975 [FOLLOWUP] We should have a separate version definition for the rapids-4-spark-hybrid dependency
#11976 [BUG] scala 2.13 rapids_integration test failed
#11971 [BUG] scala213 nightly build failed rapids-4-spark-tests_2.13 of spark400
#11903 [BUG] Unexpected large output batches due to implementation defects
#11914 [BUG] Nightly CI does not upload sources and Javadoc JARs as the release script does
#11896 [BUG] [BUILD] CI passes without checking for operatorsScore.csv, supportedExprs.csv update
#11895 [BUG] BasePythonRunner has a new parameter metrics in Spark 4.0
#11107 [BUG] Rework RapidsShuffleManager initialization for Apache Spark 4.0.0
#11897 [BUG] JsonScanRetrySuite is failing in the CI.
#11885 [BUG] data corruption with spill framework changes
#11762 [BUG] Non-nullable bools in a nullable struct fails
#11526 Fix Arithmetic tests on Databricks 14.3
#11866 [BUG]The CHANGELOG is generated based on the project's roadmap rather than the target branch.
#11749 [BUG] Include Databricks 14.3 shim into the dist jar
#11822 [BUG] [Spark 4] Type mismatch Exceptions from DFUDFShims.scala with Spark-4.0.0 expressions.Expression
#11760 [BUG] isTimestamp leaks a Scalar
#11796 [BUG] populate-daily-cache action masks errors
#10901 from_json throws exception when the json's structure only partially matches the provided schema
#11736 [BUG] Orc writes don't fully support Booleans with nulls

PRs

#12129 Update dependency version JNI, private, hybrid to 25.02.0 [skip ci]
#12102 [DOC] update the download page for 2502 release [skip ci]
#12112 HybridParquetScan: Fix velox runtime error in hybrid scan when filter timestamp
#12092 Fix an assertion error in the sized hash join
#12114 Fix HybridParquetScan over select(1)
#12109 revert ucx 1.18 upgrade
#12103 Revert "Enable event log for qualification & profiling tools testing …
#12058 upgrade jucx to 1.18
#12077 Fix the issue of ExtraPlugins loading multiple times
#12080 Quick fix for hybrid tests without git information.
#12068 Do not build Spark-4.0.0-SNAPSHOT [skip ci]
#12064 Run mvn with the project's pom.xml in hybrid_execution.sh
#12060 Relax decimal metadata checks for mismatched precision/scale
#12054 Update the version of the rapids-hybrid-execution dependency.
#11970 Explicitly set Delta table props to accommodate for different defaults
#12044 Set CI=true for complete failure reason in summary
#12050 Fixed FileSourceScanExec and BatchScanExec inadvertently falling to the CPU in non-utc orc tests
#12000 HybridParquetScan: Refine filter push down to avoid double evaluation
#12037 Removed the assumption if a plan is Columnar it probably is on the GPU
#11991 Grab the GPU Semaphore when reading cached batch data with the GPU
#11880 Perform handle spill IO outside of locked section in SpillFramework
#11997 Configure 14.3 support at runtime
#11977 Use bounce buffer pools in the Spill Framework
#11912 Ensure Java Compatibility Check for Regex Patterns
#11984 Include the size information when printing a SCB
#11889 Change order of initialization so pinned pool is available for spill framework buffers
#11956 Enable tests in RapidsParquetSchemaPruningSuite
#11981 Protect the batch read by a retry block in agg
#11967 Add support for org.apache.spark.sql.catalyst.expressions.Bin
#11982 Use common add-to-project action [skip ci]
#11978 Try to fix Scala 2.13 nightly failure: can not find version-def.sh
#11973 Minor change: Make Hybrid version a separate config like priviate repo
#11969 Support raise_error() on 14.3, Spark 4.
#11972 Update MockTaskContext to support new functions added in Spark-4.0
#11906 Enable Hybrid test cases in premerge/nightly CIs
#11720 Introduce hybrid (CPU) scan for Parquet read
#11911 Avoid concatentating multiple host buffers when reading Parquet
#11960 Remove jlowe as committer since he retired
#11958 Update to use vulnerability-scan runner [skip ci]
#11955 Add Spark 3.5.4 shim
#11959 Remove inactive user from github workflow[skip ci]
#11952 Fix auto merge conflict 11948 [skip ci]
#11908 Fix two potential OOM issues in GPU aggregate.
#11936 Add throttle time metrics for async write
#11929 make maxCpuBatchSize in GpuPartitioning configurable
#11939 [DOC] update release note to add spark 353 support [skip ci]
#11920 Remove Alluxio support
#11938 Update codeowners file to use team [skip ci]
#11915 Deploy the sources and Javadoc JARs in the nightly CICD [skip ci]
#11917 Fix issue with CustomerShuffleReaderExec metadata copy
#11910 fix bug: enable if_modified_files check for all shims in github actions [skip ci]
#11909 Update copyright year in NOTICE [skip ci]
#11907 Fix generated doc for xxhash64 for Spark 400
#11905 Fix the build error for Spark 400
#11904 Eagerly initialize RapidsShuffleManager for SPARK-45762
#11865 Async write support for ORC
#11816 address some comments for 11792
#11789 Improve the retry support for nondeterministic expressions
#11898 Add missing json reader options for JsonScanRetrySuite
#11859 Xxhash64 supports nested types
#11890 Update operatorsScore,supportedExprs for TruncDate, TruncTimestamp
#11886 Support group-limit optimization for ROW_NUMBER
#11887 Make sure that the chunked packer bounce buffer is realease after the synchronize
#11894 Fix bug: add timeout for cache deps steps [skip ci]
#11810 Use faster multi-contains in rlike regex rewrite
#11882 Add metrics GpuPartitioning.CopyToHostTime
#11864 Add support for dumping write data to try and reproduce error cases
#11781 Fix non-nullable under nullable struct write
#11877 Fix auto merge conflict 11873 [skip ci]
#11833 Support trunc and date_trunc SQL function
#11660 Add HiveHash support for nested types
#11855 Add integration test for parquet async writer
#11747 Spill framework refactor for better performance and extensibility
#11870 Workaround: Exclude cudf_log.txt in RAT check
#11867 Generate the CHANGELOG based on the PR's target branch [skip ci]
#11821 add a few more stage level metrics
#11856 Document Hive text write serialization format checks
#11805 Enable some integration tests for from_json
#11840 Support running Databricks CI_PART2 integration tests with JARs built by CI_PART1
#11847 Some small improvements
#11811 Fix bug: populate cache deps [skip ci]
#11817 Optimize Databricks Jenkins scripts [skip ci]
#11829 Some minor improvements identified during benchmark
#11827 Deal with Spark changes for column<->expression conversions
#11826 Balance the pre-merge CI job's time for the ci_1 and ci_2 tests
#11784 Add support for kudo write metrics
#11783 Fix the task count check in TrafficController
#11813 Support some escape chars when rewriting regexp_replace to stringReplace
#11819 Add the 'test_type' parameter for Databricks script
#11786 Enable license header check
#11791 Incorporate checksum of internal dependencies in the GH cache key [skip ci]
#11788 Support running Databricks CI_PART2 integration tests with JARs built by CI_PART1
#11778 Remove unnecessary toBeReturned field from serialized batch iterators
#11785 Update advanced configs introduced by private repo [skip ci]
#11772 Update rapids JNI and private dependency to 25.02.0-SNAPSHOT
#11756 remove excluded release shim and TODO

Release 24.12

Features

#11630 [FEA] enable from_json and json scan by default
#11709 [FEA] Add support for MonthsBetween
#11666 [FEA] support task limit profiling for specified stages
#11662 [FEA] Support Apache Spark 3.4.4
#11657 [FEA] Support format 'yyyyMMdd HH:mm:ss' for legacy mode
#11419 [FEA] Support Spark 3.5.3 release
#11505 [FEA] Support yyyymmdd format for GetTimestamp for LEGACY mode.

Performance

#8391 [FEA] Do a hash based re-partition instead of a sort based fallback for hash aggregate
#11560 [FEA] Improve GpuJsonToStructs performance
#11458 [FEA] enable prune_columns for from_json

Bugs Fixed

#11842 [BUG] udf-examples-native case failed core dump
#11718 [BUG] update date/time APIs in CUDF java to avoid deprecated functions
#10907 from_json function parses a column containing an empty array, throws an exception.
#11807 [BUG] mismatched cpu and gpu result in test_lead_lag_for_structs_with_arrays intermittently
#11793 [BUG] "Time in Heuristic" should not include previous operator's compute time
#11798 [BUG] mismatch CPU and GPU result in test_months_between_first_day[DATAGEN_SEED=1733006411, TZ=Africa/Casablanca]
#11790 [BUG] test_hash_* failed "java.util.NoSuchElementException: head of empty list" or "Too many times of repartition, may hit a bug?"
#11643 [BUG] Support AQE with Broadcast Hash Join and DPP on Databricks 14.3
#10910 from_json, when input = empty object, rapids throws an exception.
#10891 Parsing a column containing invalid json into StructureType with schema throws an Exception.
#11741 [BUG] Fix spark400 build due to writeWithV1 return value change
#11533 Fix JSON Matrix tests on Databricks 14.3
#11722 [BUG] Spark 4.0.0 has moved NullIntolerant and builds are breaking because they are unable to find it.
#11726 [BUG] Databricks 14.3 nightly deploy fails due to incorrect DB_SHIM_NAME
#11293 [BUG] A user query with from_json failed with "JSON Parser encountered an invalid format at location"
#9592 [BUG][JSON] from_json to Map type should produce null for invalid entries
#11715 [BUG] parquet_testing_test.py failed on "AssertionError: GPU and CPU boolean values are different"
#11716 [BUG] delta_lake_write_test.py failed on "AssertionError: GPU and CPU boolean values are different"
#11684 [BUG] 24.12 Precommit fails with wrong number of arguments in GpuDataSource
#11168 [BUG] reserve allocation should be displayed when erroring due to lack of memory on startup
#7585 [BUG] [Regexp] Line anchor '$' incorrect matching of unicode line terminators
#11622 [BUG] GPU Parquet scan filter pushdown fails with timestamp/INT96 column
#11646 [BUG] NullPointerException in GpuRand
#10498 [BUG] Unit tests failed: [INTERVAL_ARITHMETIC_OVERFLOW] integer overflow. Use 'try_add' to tolerate overflow and return NULL instead
#11659 [BUG] parse_url throws exception if partToExtract is invalid while Spark returns null
#10894 Parsing a column containing a nested structure to json thows an exception
#10895 Converting a column containing a map into json throws an exception
#10896 Converting an column containing an array into json throws an exception
#10915 to_json when converts an array will throw an exception:
#10916 to_json function doesn't support map[string, struct] to json conversion.
#10919 to_json converting map[string, integer] to json, throws an exception
#10920 to_json converting an array with maps throws an exception.
#10921 to_json - array with single map
#10923 [BUG] Spark UT framework: to_json function to convert the array with a single empty row to a JSON string throws an exception.
#10924 [BUG] Spark UT framework: to_json when converts an empty array into json throws an exception.
#11024 Fix tests failures in parquet_write_test.py
#11174 Opcode Suite fails for Scala 2.13.8+
#10483 [BUG] JsonToStructs fails to parse all empty dicts and invalid lines
#10489 [BUG] from_json does not support input with \n in it.
#10347 [BUG] Failures in Integration Tests on Dataproc Serverless
#11021 Fix tests failures in orc_cast_test.py
#11609 [BUG] test_hash_repartition_long_overflow_ansi_exception failed on 341DB
#11600 [BUG] regex_test failed mismatched cpu and gpu values in UT and IT
#11611 [BUG] Spark 4.0 build failure - value cannotSaveIntervalIntoExternalStorageError is not a member of object org.apache.spark.sql.errors.QueryCompilationErrors
#10922 from_json cannot support line separator in the input string.
#11009 Fix tests failures in cast_test.py
#11572 [BUG] MultiFileReaderThreadPool may flood the console with log messages

PRs

#11950 Update latest changelog [skip ci]
#11947 Update version to 24.12.1-SNAPSHOT [skip ci]
#11943 Update rapids JNI dependency to 24.12.1
#11944 Update download page for 24.12.1 hot fix release [skip ci]
#11876 Update latest changelog [skip ci]
#11874 Remove 350db143 shim's build [skip ci]
#11851 Update latest changelog [skip ci]
#11849 Update rapids JNI and private dependency to 24.12.0
#11841 [DOC] update doc for 24.12 release [skip ci]
#11857 Increase the pre-merge CI timeout to 6 hours
#11845 Fix leak in isTimeStamp
#11823 Fix for LEAD/LAG window function test failures.
#11832 Fix leak in GpuBroadcastNestedLoopJoinExecBase
#11763 Orc writes don't fully support Booleans with nulls
#11794 exclude previous operator's time out of firstBatchHeuristic
#11802 Fall back to CPU for non-UTC months_between
#11792 [BUG] Fix issue 11790
#11768 Fix dpp_test.py failures on 14.3
#11752 Ability to decompress snappy and zstd Parquet files via CPU
#11777 Append knoguchi22 to blossom-ci whitelist [skip ci]
#11712 repartition-based fallback for hash aggregate v3
#11771 Fix query hang when using rapids multithread shuffle manager with kudo
#11759 Avoid using StringBuffer in single-threaded methods.
#11766 Fix Kudo batch serializer to only read header in hasNext
#11730 Add support for asynchronous writing for parquet
#11750 Fix aqe_test failures on 14.3.
#11753 Enable JSON Scan and from_json by default
#11733 Print out the current attempt object when OOM inside a retry block
#11618 Execute from_json with struct schema using JSONUtils.fromJSONToStructs
#11725 host watermark metric
#11746 Remove batch size bytes limits
#11723 Add NVIDIA Copyright
#11721 Add a few more JSON tests for MAP<STRING,STRING>
#11744 Do not package the Databricks 14.3 shim into the dist jar [skip ci]
#11724 Integrate with kudo
#11739 Update to Spark 4.0 changing signature of SupportsV1Write.writeWithV1
#11737 Add in support for months_between
#11700 Fix leak with RapidsHostColumnBuilder in GpuUserDefinedFunction
#11727 Widen type promotion for decimals with larger scale in Parquet Read
#11719 Skip from_json overflow tests for 14.3
#11708 Support profiling for specific stages on a limited number of tasks
#11731 Add NullIntolerantShim to adapt to Spark 4.0 removing NullIntolerant
#11413 Support multi string contains
#11728 Change Databricks 14.3 shim name to spark350db143 [skip ci]
#11702 Improve JSON scan and from_json
#11635 Added Shims for adding Databricks 14.3 Support
#11714 Let AWS Databricks automatically choose an Availability Zone
#11703 Simplify $ transpiling and fix newline character bug
#11707 impalaFile cannot be found by UT framework.
#11697 Make delta-lake shim dependencies parametrizable
#11710 Add shim version 344 to LogicalPlanShims.scala
#11706 Add retry support in sub hash join
#11673 Fix Parquet Writer tests on 14.3
#11669 Fix string_test for 14.3
#11692 Add Spark 3.4.4 Shim
#11695 Fix spark400 build due to LogicalRelation signature changes
#11689 Update the Maven repository to download Spark JAR files [skip ci]
#11670 Fix misc_expr_test for 14.3
#11652 Fix skipping fixed_length_char ORC tests on > 13.3
#11644 Skip AQE-join-DPP tests for 14.3
#11667 Preparation for the coming Kudo support
#11685 Exclude shimplify-generated files from scalastyle
#11282 Reserve allocation should be displayed when erroring due to lack of memory on startup
#11671 Use the new host memory allocation API
#11682 Fix auto merge conflict 11679 [skip ci]
#11663 Simplify Transpilation of $ with Extended Line Separator Support in cuDF Regex
#11672 Fix race condition with Parquet filter pushdown modifying shared hadoop Configuration
#11596 Add a new NVTX range for task GPU ownership
#11664 Fix orc_write_test.py for 14.3
#11656 [DOC] update the supported OS in download page [skip ci]
#11665 Generate classes identical up to the shim package name
#11647 Fix a NPE issue in GpuRand
#11658 Support format 'yyyyMMdd HH:mm:ss' for legacy mode
#11661 Support invalid partToExtract for parse_url
#11520 UT adjust override checkScanSchemata & enabling ut of exclude_by_suffix fea.
#11634 Put DF_UDF plugin code into the main uber jar.
#11522 UT adjust test SPARK-26677: negated null-safe equality comparison
#11521 Datetime rebasing issue fixed
#11642 Update to_json to be more generic and fix some bugs
#11615 Spark 4 parquet_writer_test.py fixes
#11623 Fix collection_ops_test for 14.3
#11553 Fix udf-compiler scala2.13 internal return statements
#11640 Disable date/timestamp types by default when parsing JSON
#11570 Add support for Spark 3.5.3
#11591 Spark UT framework: Read Parquet file generated by parquet-thrift Rapids, UT case adjust.
#11631 Update JSON tests based on a closed/fixed issues
#11617 Quick fix for the build script failure of Scala 2.13 jars [skip ci]
#11614 Ensure repartition overflow test always overflows
#11612 Revert "Disable regex tests to unblock CI (#11606)"
#11597 install_deps changes for Databricks 14.3
#11608 Use mvn -f scala2.13/ in the build scripts to build the 2.13 jars
#11610 Change DataSource calendar interval error to fix spark400 build
#11549 Adopt JSONUtils.concatenateJsonStrings for concatenating JSON strings
#11595 Remove an unused config shuffle.spillThreads
#11606 Disable regex tests to unblock CI
#11605 Fix auto merge conflict 11604 [skip ci]
#11587 avoid long tail tasks due to PrioritySemaphore, remaing part
#11574 avoid long tail tasks due to PrioritySemaphore
#11559 [Spark 4.0] Address test failures in cast_test.py
#11579 Fix merge conflict with branch-24.10
#11571 Log reconfigure multi-file thread pool only once
#11564 Disk spill metric
#11561 Add in a basic plugin for dataframe UDF support in Apache Spark
#11563 Fix the latest merge conflict in integration tests
#11542 Update rapids JNI and private dependency to 24.12.0-SNAPSHOT [skip ci]
#11493 Support legacy mode for yyyymmdd format

Older Releases

Changelog of older releases can be found at docs/archives