Migrate and Validate Tables between Origin and Target Cassandra Clusters.
⚠️ Please note this job has been tested with spark version 3.3.1
- Get the latest image that includes all dependencies from DockerHub
- All migration tools (
cassandra-data-migrator
+dsbulk
+cqlsh
) would be available in the/assets/
folder of the container
- All migration tools (
- Download the latest jar file from the GitHub packages area here
- Install Java8 as spark binaries are compiled with it.
- Install Spark version 3.3.1 on a single VM (no cluster necessary) where you want to run this job. Spark can be installed by running the following: -
wget https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
tar -xvzf spark-3.3.1-bin-hadoop3.tgz
sparkConf.properties
file needs to be configured as applicable for the environmentA sample Spark conf file configuration can be found here
- Place the conf file where it can be accessed while running the job via spark-submit.
- Run the below job using
spark-submit
command as shown below:
./spark-submit --properties-file sparkConf.properties /
--master "local[*]" /
--class datastax.astra.migrate.Migrate cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
Note:
- Above command generates a log file
logfile_name.txt
to avoid log output on the console. - Add option
--driver-memory 25G --executor-memory 25G
as shown below if the table migrated is large (over 100GB)
./spark-submit --properties-file sparkConf.properties /
--master "local[*]" --driver-memory 25G --executor-memory 25G /
--class datastax.astra.migrate.Migrate cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
- To run the job in Data validation mode, use class option
--class datastax.astra.migrate.DiffData
as shown below
./spark-submit --properties-file sparkConf.properties /
--master "local[*]" /
--class datastax.astra.migrate.DiffData cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
- Validation job will report differences as “ERRORS” in the log file as shown below
22/10/27 23:25:29 ERROR DiffJobSession: Missing target row found for key: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% Aliquam faucibus
22/10/27 23:25:29 ERROR DiffJobSession: Inserted missing row in target: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% Aliquam faucibus
22/10/27 23:25:30 ERROR DiffJobSession: Mismatch row found for key: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% augue odio at quam Data: (Index: 8 Origin: Hello 3 Target: Hello 2 )
22/10/27 23:25:30 ERROR DiffJobSession: Updated mismatch row in target: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% augue odio at quam
- Please grep for all
ERROR
from the output log files to get the list of missing and mismatched records.- Note that it lists differences by primary-key values.
- The Validation job can also be run in an AutoCorrect mode. This mode can
- Add any missing records from origin to target
- Update any mismatched records between origin and target (makes target same as origin).
- Enable/disable this feature using one or both of the below setting in the config file
spark.target.autocorrect.missing true|false
spark.target.autocorrect.mismatch true|false
Note:
- The validation job will never delete records from target i.e. it only adds or updates data on target
- You can also use the tool to migrate specific partition ranges using class option
--class datastax.astra.migrate.MigratePartitionsFromFile
as shown below
./spark-submit --properties-file sparkConf.properties /
--master "local[*]" /
--class datastax.astra.migrate.MigratePartitionsFromFile cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
When running in above mode the tool assumes a partitions.csv
file to be present in the current folder in the below format, where each line (min,max
) represents a partition-range
-507900353496146534,-107285462027022883
-506781526266485690,1506166634797362039
2637884402540451982,4638499294009575633
798869613692279889,8699484505161403540
This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.
- Supports migration/validation of Counter tables
- Preserve writetimes and TTLs
- Supports migration/validation of advanced DataTypes (Sets, Lists, Maps, UDTs)
- Filter records from
Origin
usingwritetimes
and/or CQL conditions and/or min/max token-range - Supports adding
constants
as new columns onTarget
- Fully containerized (Docker and K8s friendly)
- SSL Support (including custom cipher algorithms)
- Migrate from any Cassandra
Origin
(Apache Cassandra® / DataStax Enterprise™ / DataStax Astra DB™) to any CassandraTarget
(Apache Cassandra® / DataStax Enterprise™ / DataStax Astra DB™) - Supports migration/validation from and to Azure Cosmos Cassandra
- Validate migration accuracy and performance using a smaller randomized data-set
- Supports adding custom fixed
writetime
- Clone this repo
- Move to the repo folder
cd cassandra-data-migrator
- Run the build
mvn clean package
(Needs Maven 3.8.x) - The fat jar (
cassandra-data-migrator-3.x.x.jar
) file should now be present in thetarget
folder
Checkout all our wonderful contributors here.