Skip to content

Latest commit

 

History

History
118 lines (96 loc) · 7.17 KB

README.md

File metadata and controls

118 lines (96 loc) · 7.17 KB

License Apache2 Star on Github

cassandra-data-migrator

Migrate and Validate Tables between Origin and Target Cassandra Clusters.

⚠️ Please note this job has been tested with spark version 3.3.1

Install as a Container

  • Get the latest image that includes all dependencies from DockerHub
    • All migration tools (cassandra-data-migrator + dsbulk + cqlsh) would be available in the /assets/ folder of the container

Install as a JAR file

Prerequisite

  • Install Java8 as spark binaries are compiled with it.
  • Install Spark version 3.3.1 on a single VM (no cluster necessary) where you want to run this job. Spark can be installed by running the following: -
wget https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
tar -xvzf spark-3.3.1-bin-hadoop3.tgz

Steps for Data-Migration:

  1. sparkConf.properties file needs to be configured as applicable for the environment

    A sample Spark conf file configuration can be found here

  2. Place the conf file where it can be accessed while running the job via spark-submit.
  3. Run the below job using spark-submit command as shown below:
./spark-submit --properties-file sparkConf.properties /
--master "local[*]" /
--class datastax.astra.migrate.Migrate cassandra-data-migrator-3.x.x.jar &> logfile_name.txt

Note:

  • Above command generates a log file logfile_name.txt to avoid log output on the console.
  • Add option --driver-memory 25G --executor-memory 25G as shown below if the table migrated is large (over 100GB)
./spark-submit --properties-file sparkConf.properties /
--master "local[*]" --driver-memory 25G --executor-memory 25G /
--class datastax.astra.migrate.Migrate cassandra-data-migrator-3.x.x.jar &> logfile_name.txt

Steps for Data-Validation:

  • To run the job in Data validation mode, use class option --class datastax.astra.migrate.DiffData as shown below
./spark-submit --properties-file sparkConf.properties /
--master "local[*]" /
--class datastax.astra.migrate.DiffData cassandra-data-migrator-3.x.x.jar &> logfile_name.txt
  • Validation job will report differences as “ERRORS” in the log file as shown below
22/10/27 23:25:29 ERROR DiffJobSession: Missing target row found for key: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% Aliquam faucibus
22/10/27 23:25:29 ERROR DiffJobSession: Inserted missing row in target: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% Aliquam faucibus
22/10/27 23:25:30 ERROR DiffJobSession: Mismatch row found for key: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% augue odio at quam Data:  (Index: 8 Origin: Hello 3 Target: Hello 2 )
22/10/27 23:25:30 ERROR DiffJobSession: Updated mismatch row in target: Grapes %% 1 %% 2020-05-22 %% 2020-05-23T00:05:09.353Z %% skuid %% augue odio at quam
  • Please grep for all ERROR from the output log files to get the list of missing and mismatched records.
    • Note that it lists differences by primary-key values.
  • The Validation job can also be run in an AutoCorrect mode. This mode can
    • Add any missing records from origin to target
    • Update any mismatched records between origin and target (makes target same as origin).
  • Enable/disable this feature using one or both of the below setting in the config file
spark.target.autocorrect.missing                    true|false
spark.target.autocorrect.mismatch                   true|false

Note:

  • The validation job will never delete records from target i.e. it only adds or updates data on target

Migrating specific partition ranges

  • You can also use the tool to migrate specific partition ranges using class option --class datastax.astra.migrate.MigratePartitionsFromFile as shown below
./spark-submit --properties-file sparkConf.properties /
--master "local[*]" /
--class datastax.astra.migrate.MigratePartitionsFromFile cassandra-data-migrator-3.x.x.jar &> logfile_name.txt

When running in above mode the tool assumes a partitions.csv file to be present in the current folder in the below format, where each line (min,max) represents a partition-range

-507900353496146534,-107285462027022883
-506781526266485690,1506166634797362039
2637884402540451982,4638499294009575633
798869613692279889,8699484505161403540

This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.

Features

Building Jar for local development

  1. Clone this repo
  2. Move to the repo folder cd cassandra-data-migrator
  3. Run the build mvn clean package (Needs Maven 3.8.x)
  4. The fat jar (cassandra-data-migrator-3.x.x.jar) file should now be present in the target folder

Contributors

Checkout all our wonderful contributors here.