Change log

Generated on 2025-02-17

Release 25.02

Features


#11648	[FEA] it would be nice if we could support org.apache.spark.sql.catalyst.expressions.Bin
#11891	[FEA] Support Spark 3.5.4 release
#11928	[FEA] make maxCpuBatchSize in GpuPartitioning configurable
#10505	[FOLLOW UP] Support `row_number()` filters for `GpuWindowGroupLimitExec`
#11853	[FEA] Ability to dump tables on a write
#11804	[FEA] Support TruncDate expression
#11674	[FEA] HiveHash supports nested types

Performance


#11342	[FEA] Put file writes in a background thread
#11729	[FEA] optimize the multi-contains generated by rlike
#11860	[FEA] kernel for date_trunc and trunc that has a scalar format
#11812	[FEA] Support escape characters in search list when rewrite `regexp_replace` to string replace

Bugs Fixed


#12091	[BUG]An assertion error in the sized hash join
#12096	[BUG] CI_PART1 for DBR 14.3 hangs in the nightly pre-release pipeline
#12076	[BUG] ExtraPlugins might be loaded duplicated
#11433	[BUG] Spark UT framework: SPARK-34212 Parquet should read decimals correctly
#12038	[BUG] spark321 failed core dump in nightly
#12046	[BUG] orc_test fail non-UTC cases with Part of the plan is not columnar class org.apache.spark.sql.execution.FileSourceScanExec
#12039	[BUG] HostAllocSuite failed "Maximum pool size exceeded" in nightly UT tests
#12036	[BUG] The check in assertIsOnTheGpu method to test if a plan is on the GPU is not accurate
#11989	[BUG] ParquetCachedBatchSerializer does not grab the GPU semaphore and does not have retry blocks
#11651	[BUG] Parse regular expressions using JDK to make error behavior more consistent between CPU and GPU
#11628	[BUG] Spark UT framework: select one deep nested complex field after join, IOException parsing parquet
#11629	[BUG] Spark UT framework: select one deep nested complex field after outer join, IOException parsing parquet
#11620	[BUG] Spark UT framework: "select a single complex field and partition column" causes java.lang.IndexOutOfBoundsException
#11621	[BUG] Spark UT framework: partial schema intersection - select missing subfield causes java.lang.IndexOutOfBoundsException
#11619	[BUG] Spark UT framework: "select a single complex field" causes java.lang.IndexOutOfBoundsException
#11975	[FOLLOWUP] We should have a separate version definition for the rapids-4-spark-hybrid dependency
#11976	[BUG] scala 2.13 rapids_integration test failed
#11971	[BUG] scala213 nightly build failed rapids-4-spark-tests_2.13 of spark400
#11903	[BUG] Unexpected large output batches due to implementation defects
#11914	[BUG] Nightly CI does not upload sources and Javadoc JARs as the release script does
#11896	[BUG] [BUILD] CI passes without checking for `operatorsScore.csv`, `supportedExprs.csv` update
#11895	[BUG] BasePythonRunner has a new parameter metrics in Spark 4.0
#11107	[BUG] Rework RapidsShuffleManager initialization for Apache Spark 4.0.0
#11897	[BUG] JsonScanRetrySuite is failing in the CI.
#11885	[BUG] data corruption with spill framework changes
#11762	[BUG] Non-nullable bools in a nullable struct fails
#11526	Fix Arithmetic tests on Databricks 14.3
#11866	[BUG]The CHANGELOG is generated based on the project's roadmap rather than the target branch.
#11749	[BUG] Include Databricks 14.3 shim into the dist jar
#11822	[BUG] [Spark 4] Type mismatch Exceptions from DFUDFShims.scala with Spark-4.0.0 expressions.Expression
#11760	[BUG] isTimestamp leaks a Scalar
#11796	[BUG] populate-daily-cache action masks errors
#10901	from_json throws exception when the json's structure only partially matches the provided schema
#11736	[BUG] Orc writes don't fully support Booleans with nulls

PRs


#12129	Update dependency version JNI, private, hybrid to 25.02.0 [skip ci]
#12102	[DOC] update the download page for 2502 release [skip ci]
#12112	HybridParquetScan: Fix velox runtime error in hybrid scan when filter timestamp
#12092	Fix an assertion error in the sized hash join
#12114	Fix HybridParquetScan over select(1)
#12109	revert ucx 1.18 upgrade
#12103	Revert "Enable event log for qualification & profiling tools testing …
#12058	upgrade jucx to 1.18
#12077	Fix the issue of ExtraPlugins loading multiple times
#12080	Quick fix for hybrid tests without git information.
#12068	Do not build Spark-4.0.0-SNAPSHOT [skip ci]
#12064	Run mvn with the project's pom.xml in hybrid_execution.sh
#12060	Relax decimal metadata checks for mismatched precision/scale
#12054	Update the version of the rapids-hybrid-execution dependency.
#11970	Explicitly set Delta table props to accommodate for different defaults
#12044	Set CI=true for complete failure reason in summary
#12050	Fixed `FileSourceScanExec` and `BatchScanExec` inadvertently falling to the CPU in non-utc orc tests
#12000	HybridParquetScan: Refine filter push down to avoid double evaluation
#12037	Removed the assumption if a plan is Columnar it probably is on the GPU
#11991	Grab the GPU Semaphore when reading cached batch data with the GPU
#11880	Perform handle spill IO outside of locked section in SpillFramework
#11997	Configure 14.3 support at runtime
#11977	Use bounce buffer pools in the Spill Framework
#11912	Ensure Java Compatibility Check for Regex Patterns
#11984	Include the size information when printing a SCB
#11889	Change order of initialization so pinned pool is available for spill framework buffers
#11956	Enable tests in RapidsParquetSchemaPruningSuite
#11981	Protect the batch read by a retry block in agg
#11967	Add support for `org.apache.spark.sql.catalyst.expressions.Bin`
#11982	Use common add-to-project action [skip ci]
#11978	Try to fix Scala 2.13 nightly failure: can not find version-def.sh
#11973	Minor change: Make Hybrid version a separate config like priviate repo
#11969	Support `raise_error()` on 14.3, Spark 4.
#11972	Update MockTaskContext to support new functions added in Spark-4.0
#11906	Enable Hybrid test cases in premerge/nightly CIs
#11720	Introduce hybrid (CPU) scan for Parquet read
#11911	Avoid concatentating multiple host buffers when reading Parquet
#11960	Remove jlowe as committer since he retired
#11958	Update to use vulnerability-scan runner [skip ci]
#11955	Add Spark 3.5.4 shim
#11959	Remove inactive user from github workflow[skip ci]
#11952	Fix auto merge conflict 11948 [skip ci]
#11908	Fix two potential OOM issues in GPU aggregate.
#11936	Add throttle time metrics for async write
#11929	make maxCpuBatchSize in GpuPartitioning configurable
#11939	[DOC] update release note to add spark 353 support [skip ci]
#11920	Remove Alluxio support
#11938	Update codeowners file to use team [skip ci]
#11915	Deploy the sources and Javadoc JARs in the nightly CICD [skip ci]
#11917	Fix issue with CustomerShuffleReaderExec metadata copy
#11910	fix bug: enable if_modified_files check for all shims in github actions [skip ci]
#11909	Update copyright year in NOTICE [skip ci]
#11907	Fix generated doc for xxhash64 for Spark 400
#11905	Fix the build error for Spark 400
#11904	Eagerly initialize RapidsShuffleManager for SPARK-45762
#11865	Async write support for ORC
#11816	address some comments for 11792
#11789	Improve the retry support for nondeterministic expressions
#11898	Add missing json reader options for JsonScanRetrySuite
#11859	Xxhash64 supports nested types
#11890	Update operatorsScore,supportedExprs for TruncDate, TruncTimestamp
#11886	Support group-limit optimization for `ROW_NUMBER`
#11887	Make sure that the chunked packer bounce buffer is realease after the synchronize
#11894	Fix bug: add timeout for cache deps steps [skip ci]
#11810	Use faster multi-contains in `rlike` regex rewrite
#11882	Add metrics GpuPartitioning.CopyToHostTime
#11864	Add support for dumping write data to try and reproduce error cases
#11781	Fix non-nullable under nullable struct write
#11877	Fix auto merge conflict 11873 [skip ci]
#11833	Support `trunc` and `date_trunc` SQL function
#11660	Add `HiveHash` support for nested types
#11855	Add integration test for parquet async writer
#11747	Spill framework refactor for better performance and extensibility
#11870	Workaround: Exclude cudf_log.txt in RAT check
#11867	Generate the CHANGELOG based on the PR's target branch [skip ci]
#11821	add a few more stage level metrics
#11856	Document Hive text write serialization format checks
#11805	Enable some integration tests for `from_json`
#11840	Support running Databricks CI_PART2 integration tests with JARs built by CI_PART1
#11847	Some small improvements
#11811	Fix bug: populate cache deps [skip ci]
#11817	Optimize Databricks Jenkins scripts [skip ci]
#11829	Some minor improvements identified during benchmark
#11827	Deal with Spark changes for column<->expression conversions
#11826	Balance the pre-merge CI job's time for the ci_1 and ci_2 tests
#11784	Add support for kudo write metrics
#11783	Fix the task count check in TrafficController
#11813	Support some escape chars when rewriting regexp_replace to stringReplace
#11819	Add the 'test_type' parameter for Databricks script
#11786	Enable license header check
#11791	Incorporate checksum of internal dependencies in the GH cache key [skip ci]
#11788	Support running Databricks CI_PART2 integration tests with JARs built by CI_PART1
#11778	Remove unnecessary toBeReturned field from serialized batch iterators
#11785	Update advanced configs introduced by private repo [skip ci]
#11772	Update rapids JNI and private dependency to 25.02.0-SNAPSHOT
#11756	remove excluded release shim and TODO

Release 24.12

Features


#11630	[FEA] enable from_json and json scan by default
#11709	[FEA] Add support for `MonthsBetween`
#11666	[FEA] support task limit profiling for specified stages
#11662	[FEA] Support Apache Spark 3.4.4
#11657	[FEA] Support format 'yyyyMMdd HH:mm:ss' for legacy mode
#11419	[FEA] Support Spark 3.5.3 release
#11505	[FEA] Support yyyymmdd format for GetTimestamp for LEGACY mode.

Performance


#8391	[FEA] Do a hash based re-partition instead of a sort based fallback for hash aggregate
#11560	[FEA] Improve `GpuJsonToStructs` performance
#11458	[FEA] enable prune_columns for from_json

Bugs Fixed


#11842	[BUG] udf-examples-native case failed core dump
#11718	[BUG] update date/time APIs in CUDF java to avoid deprecated functions
#10907	from_json function parses a column containing an empty array, throws an exception.
#11807	[BUG] mismatched cpu and gpu result in test_lead_lag_for_structs_with_arrays intermittently
#11793	[BUG] "Time in Heuristic" should not include previous operator's compute time
#11798	[BUG] mismatch CPU and GPU result in test_months_between_first_day[DATAGEN_SEED=1733006411, TZ=Africa/Casablanca]
#11790	[BUG] test_hash_* failed "java.util.NoSuchElementException: head of empty list" or "Too many times of repartition, may hit a bug?"
#11643	[BUG] Support AQE with Broadcast Hash Join and DPP on Databricks 14.3
#10910	from_json, when input = empty object, rapids throws an exception.
#10891	Parsing a column containing invalid json into StructureType with schema throws an Exception.
#11741	[BUG] Fix spark400 build due to writeWithV1 return value change
#11533	Fix JSON Matrix tests on Databricks 14.3
#11722	[BUG] Spark 4.0.0 has moved `NullIntolerant` and builds are breaking because they are unable to find it.
#11726	[BUG] Databricks 14.3 nightly deploy fails due to incorrect DB_SHIM_NAME
#11293	[BUG] A user query with from_json failed with "JSON Parser encountered an invalid format at location"
#9592	[BUG][JSON] `from_json` to Map type should produce null for invalid entries
#11715	[BUG] parquet_testing_test.py failed on "AssertionError: GPU and CPU boolean values are different"
#11716	[BUG] delta_lake_write_test.py failed on "AssertionError: GPU and CPU boolean values are different"
#11684	[BUG] 24.12 Precommit fails with wrong number of arguments in `GpuDataSource`
#11168	[BUG] reserve allocation should be displayed when erroring due to lack of memory on startup
#7585	[BUG] [Regexp] Line anchor '$' incorrect matching of unicode line terminators
#11622	[BUG] GPU Parquet scan filter pushdown fails with timestamp/INT96 column
#11646	[BUG] NullPointerException in GpuRand
#10498	[BUG] Unit tests failed: [INTERVAL_ARITHMETIC_OVERFLOW] integer overflow. Use 'try_add' to tolerate overflow and return NULL instead
#11659	[BUG] parse_url throws exception if partToExtract is invalid while Spark returns null
#10894	Parsing a column containing a nested structure to json thows an exception
#10895	Converting a column containing a map into json throws an exception
#10896	Converting an column containing an array into json throws an exception
#10915	to_json when converts an array will throw an exception:
#10916	to_json function doesn't support map[string, struct] to json conversion.
#10919	to_json converting map[string, integer] to json, throws an exception
#10920	to_json converting an array with maps throws an exception.
#10921	to_json - array with single map
#10923	[BUG] Spark UT framework: to_json function to convert the array with a single empty row to a JSON string throws an exception.
#10924	[BUG] Spark UT framework: to_json when converts an empty array into json throws an exception.
#11024	Fix tests failures in parquet_write_test.py
#11174	Opcode Suite fails for Scala 2.13.8+
#10483	[BUG] JsonToStructs fails to parse all empty dicts and invalid lines
#10489	[BUG] from_json does not support input with \n in it.
#10347	[BUG] Failures in Integration Tests on Dataproc Serverless
#11021	Fix tests failures in orc_cast_test.py
#11609	[BUG] test_hash_repartition_long_overflow_ansi_exception failed on 341DB
#11600	[BUG] regex_test failed mismatched cpu and gpu values in UT and IT
#11611	[BUG] Spark 4.0 build failure - value cannotSaveIntervalIntoExternalStorageError is not a member of object org.apache.spark.sql.errors.QueryCompilationErrors
#10922	from_json cannot support line separator in the input string.
#11009	Fix tests failures in cast_test.py
#11572	[BUG] MultiFileReaderThreadPool may flood the console with log messages

PRs


#11950	Update latest changelog [skip ci]
#11947	Update version to 24.12.1-SNAPSHOT [skip ci]
#11943	Update rapids JNI dependency to 24.12.1
#11944	Update download page for 24.12.1 hot fix release [skip ci]
#11876	Update latest changelog [skip ci]
#11874	Remove 350db143 shim's build [skip ci]
#11851	Update latest changelog [skip ci]
#11849	Update rapids JNI and private dependency to 24.12.0
#11841	[DOC] update doc for 24.12 release [skip ci]
#11857	Increase the pre-merge CI timeout to 6 hours
#11845	Fix leak in isTimeStamp
#11823	Fix for `LEAD/LAG` window function test failures.
#11832	Fix leak in GpuBroadcastNestedLoopJoinExecBase
#11763	Orc writes don't fully support Booleans with nulls
#11794	exclude previous operator's time out of firstBatchHeuristic
#11802	Fall back to CPU for non-UTC months_between
#11792	[BUG] Fix issue 11790
#11768	Fix `dpp_test.py` failures on 14.3
#11752	Ability to decompress snappy and zstd Parquet files via CPU
#11777	Append knoguchi22 to blossom-ci whitelist [skip ci]
#11712	repartition-based fallback for hash aggregate v3
#11771	Fix query hang when using rapids multithread shuffle manager with kudo
#11759	Avoid using StringBuffer in single-threaded methods.
#11766	Fix Kudo batch serializer to only read header in hasNext
#11730	Add support for asynchronous writing for parquet
#11750	Fix aqe_test failures on 14.3.
#11753	Enable JSON Scan and from_json by default
#11733	Print out the current attempt object when OOM inside a retry block
#11618	Execute `from_json` with struct schema using `JSONUtils.fromJSONToStructs`
#11725	host watermark metric
#11746	Remove batch size bytes limits
#11723	Add NVIDIA Copyright
#11721	Add a few more JSON tests for MAP<STRING,STRING>
#11744	Do not package the Databricks 14.3 shim into the dist jar [skip ci]
#11724	Integrate with kudo
#11739	Update to Spark 4.0 changing signature of SupportsV1Write.writeWithV1
#11737	Add in support for months_between
#11700	Fix leak with RapidsHostColumnBuilder in GpuUserDefinedFunction
#11727	Widen type promotion for decimals with larger scale in Parquet Read
#11719	Skip `from_json` overflow tests for 14.3
#11708	Support profiling for specific stages on a limited number of tasks
#11731	Add NullIntolerantShim to adapt to Spark 4.0 removing NullIntolerant
#11413	Support multi string contains
#11728	Change Databricks 14.3 shim name to spark350db143 [skip ci]
#11702	Improve JSON scan and `from_json`
#11635	Added Shims for adding Databricks 14.3 Support
#11714	Let AWS Databricks automatically choose an Availability Zone
#11703	Simplify $ transpiling and fix newline character bug
#11707	impalaFile cannot be found by UT framework.
#11697	Make delta-lake shim dependencies parametrizable
#11710	Add shim version 344 to LogicalPlanShims.scala
#11706	Add retry support in sub hash join
#11673	Fix Parquet Writer tests on 14.3
#11669	Fix `string_test` for 14.3
#11692	Add Spark 3.4.4 Shim
#11695	Fix spark400 build due to LogicalRelation signature changes
#11689	Update the Maven repository to download Spark JAR files [skip ci]
#11670	Fix `misc_expr_test` for 14.3
#11652	Fix skipping fixed_length_char ORC tests on > 13.3
#11644	Skip AQE-join-DPP tests for 14.3
#11667	Preparation for the coming Kudo support
#11685	Exclude shimplify-generated files from scalastyle
#11282	Reserve allocation should be displayed when erroring due to lack of memory on startup
#11671	Use the new host memory allocation API
#11682	Fix auto merge conflict 11679 [skip ci]
#11663	Simplify Transpilation of $ with Extended Line Separator Support in cuDF Regex
#11672	Fix race condition with Parquet filter pushdown modifying shared hadoop Configuration
#11596	Add a new NVTX range for task GPU ownership
#11664	Fix `orc_write_test.py` for 14.3
#11656	[DOC] update the supported OS in download page [skip ci]
#11665	Generate classes identical up to the shim package name
#11647	Fix a NPE issue in GpuRand
#11658	Support format 'yyyyMMdd HH:mm:ss' for legacy mode
#11661	Support invalid partToExtract for parse_url
#11520	UT adjust override checkScanSchemata & enabling ut of exclude_by_suffix fea.
#11634	Put DF_UDF plugin code into the main uber jar.
#11522	UT adjust test SPARK-26677: negated null-safe equality comparison
#11521	Datetime rebasing issue fixed
#11642	Update to_json to be more generic and fix some bugs
#11615	Spark 4 parquet_writer_test.py fixes
#11623	Fix `collection_ops_test` for 14.3
#11553	Fix udf-compiler scala2.13 internal return statements
#11640	Disable date/timestamp types by default when parsing JSON
#11570	Add support for Spark 3.5.3
#11591	Spark UT framework: Read Parquet file generated by parquet-thrift Rapids, UT case adjust.
#11631	Update JSON tests based on a closed/fixed issues
#11617	Quick fix for the build script failure of Scala 2.13 jars [skip ci]
#11614	Ensure repartition overflow test always overflows
#11612	Revert "Disable regex tests to unblock CI (#11606)"
#11597	`install_deps` changes for Databricks 14.3
#11608	Use mvn -f scala2.13/ in the build scripts to build the 2.13 jars
#11610	Change DataSource calendar interval error to fix spark400 build
#11549	Adopt `JSONUtils.concatenateJsonStrings` for concatenating JSON strings
#11595	Remove an unused config shuffle.spillThreads
#11606	Disable regex tests to unblock CI
#11605	Fix auto merge conflict 11604 [skip ci]
#11587	avoid long tail tasks due to PrioritySemaphore, remaing part
#11574	avoid long tail tasks due to PrioritySemaphore
#11559	[Spark 4.0] Address test failures in cast_test.py
#11579	Fix merge conflict with branch-24.10
#11571	Log reconfigure multi-file thread pool only once
#11564	Disk spill metric
#11561	Add in a basic plugin for dataframe UDF support in Apache Spark
#11563	Fix the latest merge conflict in integration tests
#11542	Update rapids JNI and private dependency to 24.12.0-SNAPSHOT [skip ci]
#11493	Support legacy mode for yyyymmdd format

Older Releases

Changelog of older releases can be found at docs/archives

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Change log

Release 25.02

Features

Performance

Bugs Fixed

PRs

Release 24.12

Features

Performance

Bugs Fixed

PRs

Older Releases

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Change log

Release 25.02

Features

Performance

Bugs Fixed

PRs

Release 24.12

Features

Performance

Bugs Fixed

PRs

Older Releases