- Upgraded to use Spark
3.5.4
. - Cassandra Docker image tag is now set to
cassandra:5
.
- Bug fix: Any run started with a
previousRunId
that is not found in thecdm_run_info
table (for whatever reason), will be executed as a fresh new run instead of doing nothing.
- Bug fix: Fixed connection issue caused when using different types of origin and target clusters (e.g. Cassandra/DSE with host/port and Astra with SCB).
- Bug fix: SCB file on some Spark worker nodes may get deleted before the connection is established, which may cause connection exception on that worker node. Added a static async SCB delete delay to address such issues.
- Bug fix: Writetime filter does not work as expected when custom writetimestamp is also used (issue #327).
- Removed deprecated properties
printStatsAfter
andprintStatsPerPart
. Run metrics should now be tracked using thetrackRun
feature instead.
- Improves metrics output by producing stats labels in an intuitive and consistent order
- Refactored JobCounter by removing any references to
thread
orglobal
as CDM operations are now isolated within partition-ranges (parts
). Each suchpart
is then parallelly processed and aggregated by Spark.
- CDM refactored to be fully Spark Native and more performant when deployed on a multi-node Spark Cluster
trackRun
feature has been expanded to recordrun-info
for each part in theCDM_RUN_DETAILS
table. Along with granular metrics, this information can be used to troubleshoot any unbalanced problematic partitions.- This release has feature parity with 4.x release and is also backword compatible while adding the above mentioned improvements. However, we are upgrading it to 5.x as its a major rewrite of the code to make it Spark native.
- CDM refactored to work when deployed on a Spark Cluster
- More performant for large migration efforts (multi-terabytes clusters with several billions of rows) using Spark Cluster (instead of individual VMs)
- No functional changes and fully backward compatible, just refactor to support Spark cluster deployment
Note: The Spark Cluster based deployment in this release currently has a bug. It reports '0' for all count metrics, while doing underlying tasks (Migration, Validation, etc.). We are working to address this in the upcoming releases. Also note that this issue is only with the Spark cluster deployment and not with the single VM run (i.e. no impact to current users).
- Allow using Collections and/or UDTs for
ttl
&writetime
calculations. This is specifically helpful in scenarios where the only non-key columns are Collections and/or UDTs.
- Made CDM generated SCB unique & much short-lived when using the TLS option to connect to Astra more securely.
- Upgraded to use log4j 2.x and included a template properties file that will help separate general logs from CDM class specific logs including a separate log for rows identified by
DiffData
(Validation) errors. - Upgraded to use Spark
3.5.3
.
- Added two new codecs
STRING_BLOB
andASCII_BLOB
to allow migration fromTEXT
andASCII
fields toBLOB
fields. These codecs can also be used to convertBLOB
toTEXT
orASCII
, but in such cases theBLOB
value must be TEXT based in nature & fit within the applicable limits.
- Added property
spark.cdm.connect.origin.tls.isAstra
andspark.cdm.connect.target.tls.isAstra
to allow connecting to Astra DB without using SCB. This may be needed for enterprises that may find credentials packaged within SCB as a security risk. TLS properties can now be passed as params OR wrapper scripts (not included) could be used to pull sensitive credentials from a vault service in real-time & pass them to CDM. - Switched to using Apache Cassandra®
5.0
docker image for testing - Introduces smoke testing of
vector
CQL data type
- Added property
spark.cdm.trackRun.runId
to support a custom unique identifier for the current run. This can be used by wrapper scripts to pass a knownrunId
and then use it to query thecdm_run_info
andcdm_run_details
tables.
- Added new
status
value ofDIFF_CORRECTED
oncdm_run_details
table to specifically mark partitions that were corrected during the CDM validation run. - Upgraded Validation job to skip partitions with
DIFF_CORRECTED
status on rerun with a previousrunId
.
- Upgraded
spark.cdm.trackRun
feature to includestatus
oncdm_run_info
table. Also improved the code to handle rerun of previous run which may have exited before being correctly initialized.
- Added property
spark.cdm.transform.custom.ttl
to allow a custom constant value to be set for TTL instead of using the values fromorigin
rows. - Repo wide code formating & imports organization
- Added
overwrite
option to conditionally check or skipValidation
when it has a non-null value intarget
for thespark.cdm.feature.extractJson
feature.
- Added feature
spark.cdm.feature.extractJson
which allows you to extract a json value from a column with json content in an Origin table and map it to a column in the Target table. - Upgraded to use Spark
3.5.2
.
- Use
spark.cdm.schema.origin.keyspaceTable
whenspark.cdm.schema.target.keyspaceTable
is missing. Fixes bug introduced in prior version.
- Removed deprecated functionality related to processing token-ranges via partition-file
- Upgraded Spark Cassandra Connector (SCC) version to 3.5.1.
- Minor big fix (Enable tracking when only
previousRunId
provided, buttrackRun
not set totrue
)
- Removed deprecated functionality related to retry
- Fixed a validation run bug that sometimes did not report a failed token-range
- Removed deprecated MigrateRowsFromFile job
- Added
spark.cdm.trackRun
feature to support stop and resume function for Migration and Validation jobs - Validation jobs ran with
auto-correct
feature disabled, can now be rerun withauto-correct
feature enabled in a much optimal way to only correct the token-ranges with validation errors during the rerun - Records summary and details of each run in tables (
cdm_run_info
andcdm_run_details
) ontarget
keyspace
- Upgraded
constant-column
feature to supportreplace
andremove
of constant columns - Fixed
constant-column
feature to support any data-types within the PK columns - Added
Things to know
in docs
- Added property to manage null values in Map fields
- Allow separate input and output partition CSV files
- Updated README
- Internal CI/CD release fix
- Fixed OOM bug caused when using partition file with large value for num-parts
- Upgraded to use Spark
3.5.1
.
- Upgraded to use Spark
3.4.2
. - Added Java
11
as the minimally required pre-requisite to run CDM jobs.
- Code test & coverage changes
- Upgraded to use Scala 2.13
- Allow support for Spark 3.4.1, SCC 3.4.1 and begin automated testing using Cassandra® latest 4 series.
- Improved unit test coverage
- Allow support for vector CQL data type
- Allow reserved keywords used as Target column-names
- In rare edge situations, counter tables with existing data in Target can have null values on target. This release will handle null values in Target counter table transparently.
- Counter table columns will usually have zeros to begin with, but in rare edge situations, they can have null values. This release will handle null values in counter table transparently.
- Fixed docker build
- Documentation fixes in readme & properties file
- Config namespace fixes
- Refactored exception handling and loading of token-range filters to use the same Migrate & DiffData jobs instead of separate jobs to reduce code & maintenance overhead
- Capture failed partitions in a file for easier reruns
- Optimized mvn to reduce jar size
- Fixed bugs in docs
- Fixes broken maven link in docker build process
- Upgrades to latest stable Maven 3.x
This release is a major code refactor of Cassandra Data Migrator, focused on internal code structure and organization. Automated testing (both unit and integration) was introduced and incorporated into the build process. It includes all features of the previous version, but the properties specified within configuration (.properties) file have been re-organized and renamed; therefore, the configuration file from the previous version will not work with this version.
New features were also introduced with this release, on top of the 3.4.5 version.
- New features:
Column renaming
: Column names can differ between Origin and TargetMigrate UDTs across keyspaces
: UDTs can be migrated from Origin to Target, even when the keyspace names differData Type Conversion
: Some predefined Codecs support type conversion between Origin and Target; custom Codecs can be addedSeparate Writetime and TTL configuration
: Writetime columns can differ from TTL columnsSubset of columns can be specified with Writetime and TTL
: Not all eligible columns need to be used to compute the origin valueAutomatic RandomPartitioner min/max
: Partition min/max values no longer need to be manually configuredPopulate Target columns with constant values
: New columns can be added to the Target table, and populated with constant valuesExplode Origin Map Column into Target rows
: A Map in Origin can be expanded into multiple rows in Target when the Map key is part of the Target primary key
Previous releases of the project have not been documented in this file