Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/ecs paths mapping script #2197

Merged
merged 17 commits into from
Dec 7, 2023
Merged

Conversation

dk1844
Copy link
Contributor

@dk1844 dk1844 commented Oct 20, 2023

This PR adds a script to remap hdfs paths based on a service response. Primary usage is for hdfs to ECS migration (with defaults set for this purpose), but the script is general in nature.

Naively Dev-Tested on local mongoDB.

When reading and thinking that some parts have no relation to this script (e.g. migration_free_only=False), note, that this script reuses a lot of the sibling migrate_menas.py - that being the reason.

Examples

Help with params overview:

> python dataset_paths_to_ecs.py --help                                                                                                                                                        
usage: dataset_paths_to_ecs [-h] [-n] [-v] [-t TARGETDB] [-u MAPPINGSERVICE] [-p MAPPINGPREFIX] [-s SKIP_PREFIX [SKIP_PREFIX ...]] [-f {hdfsPath,hdfsPublishPath,all}] [-d DATASET_NAME [DATASET_NAME ...]] [-o] TARGET

Menas MongoDB path changes to ECS

positional arguments:
  TARGET                connection string for target MongoDB

options:
  -h, --help            show this help message and exit
  -n, --dryrun          if specified, skip the actual changes, just print what would be done. (default: False)
  -v, --verbose         prints extra information while running. (default: False)
  -t TARGETDB, --target-database TARGETDB
                        Name of db on target to be affected. (default: menas)
  -u MAPPINGSERVICE, --mapping-service-url MAPPINGSERVICE
                        Service URL to use for path change mapping. (default: https://set-your-mapping-service-here.execute-api.af-south-1.amazonaws.com/dev/map)
  -p MAPPINGPREFIX, --mapping-prefix MAPPINGPREFIX
                        This prefix will be prepended to mapped path by the Mapping service (default: s3a://)
  -s SKIP_PREFIX [SKIP_PREFIX ...], --skip-prefixes SKIP_PREFIX [SKIP_PREFIX ...]
                        Path with these prefixes will be skipped from mapping (default: ['s3a://', '/tmp'])
  -f {hdfsPath,hdfsPublishPath,all}, --fields-to-map {hdfsPath,hdfsPublishPath,all}
                        Map either item's 'hdfsPath', 'hdfsPublishPath' or 'all' (default: all)
  -d DATASET_NAME [DATASET_NAME ...], --datasets DATASET_NAME [DATASET_NAME ...]
                        list datasets names to change paths in (default: [])
  -o, --only-datasets   if specified, mapping table changes will NOT be done. (default: False)

Example run for datasets DM9_actn_Cd and DM9_cnsmr_accnt_Sttlmnt

-d - dataset
-t - target db
-u - mapping service URL
-o - only map datasets, not related mapping tables
-f hdfsPublishPath - only hdfsPublishPath field will get path-changed (so hdfsPath will be kept as-is).

> python dataset_paths_to_ecs.py mongodb://localhost:27017/admin -d DM9_actn_Cd DM9_cnsmr_accnt_Sttlmnt -t menas_remap_test -u https://<redacted>.execute-api.af-south-1.amaz
onaws.com/test/map  -f hdfsPublishPath -o
Menas mongo ECS paths mapping
Running with settings: dryrun=False, verbose=False
Using mapping service at: https://<redacted>.execute-api.af-south-1.amazonaws.com/test/map
  target connection-string: mongodb://localhost:27017/admin
  target DB: menas_remap_test
Dataset names to path change (actually found db): ['DM9_actn_Cd', 'DM9_cnsmr_accnt_Sttlmnt']
Configured *NOT* to path-change related mapping tables.

Path changing of collection dataset_v1 started
Found: 3 dataset documents for a potential path change. In progress ...
Successfully migrated 3 of 3 dataset entries, failed: 0

Done.

Example run for dataset XMSK083 - has mapping table ties:

-d - dataset
-t - target db
-u mapping service URL
-n dryrun (just print)
-v verbose

> python dataset_paths_to_ecs.py mongodb://localhost:27017/admin -d XMSK083 -t menas_remap_test -v -u https://<redacted>.execute-api.af-south-1.amazonaws.com/test/map -n    
Menas mongo ECS paths mapping
Running with settings: dryrun=True, verbose=True
Using mapping service at: https://<redacted>.execute-api.af-south-1.amazonaws.com/test/map
  target connection-string: mongodb://localhost:27017/admin
  target DB: menas_remap_test
Dataset names given: ['XMSK083']
Dataset names to path change (actually found db): ['XMSK083']
MTs to path change: ['SourceSystemMappingTable']

Path changing of collection dataset_v1 started
Found: 1 dataset documents for a potential path change. In progress ...
Changing paths for dataset 'XMSK083' v5 (_id=5bbc544b2cdc7510a4930f1f).
  *would set* hdfsPath: /bigdatahdfs/datalake/raw/cpf/XMSK083/ -> s3a://<redacted>-prod-edla-cpf-za/raw/XMSK083/, hdfsPublishPath: /bigdatahdfs/datalake/publish/cpf/XMSK083/ -> s3a://<redacted>-prod-edla-cpf-za/publish/XMSK083/

Successfully migrated 0 of 1 dataset entries, failed: 0

Path changing of collection mapping_table_v1 started
Found: 2 mapping table documents for a potential path change. In progress ...
Changing paths for mapping table 'SourceSystemMappingTable' v5 (_id=5b6d732ba43a28a6151422aa).
  *would set* hdfsPath: /bigdatahdfs/datalake/common/mdrc/publish/LATEST/SourceSystemMapping -> s3a://<redacted>-prod-edla-mdrc-za/common/publish/LATEST/SourceSystemMapping/

Changing paths for mapping table 'SourceSystemMappingTable' v1 (_id=5abbaa1e8cdba293c9f0b5a3).
  *would set* hdfsPath: /bigdatahdfs/datalake/common/mdrc/publish/LATEST5/SourceSystemMapping -> s3a://<redacted>-prod-edla-mdrc-za/common/publish/LATEST5/SourceSystemMapping/

Successfully migrated 0 of 2 mapping table entries, failed: 0

Done.

…related mapping tables will not be path-changed.
…ping-service-url`.

`-f/--fields-to-map=[hdfsPath|hdfsPublishPath|all(default)]` param has been added.
- there is now no `pathChanged` mark in the DB, instead - the -s/--skip-prefix value is check to determined if path-changed or not.
… ECS service mapping check added; exception communicates the HDFS path used
Copy link
Collaborator

@miroslavpojer miroslavpojer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • pull
  • code review
  • manually test

miroslavpojer
miroslavpojer previously approved these changes Nov 1, 2023
@dk1844
Copy link
Contributor Author

dk1844 commented Nov 8, 2023

Skip prefixes feature has been updated:

  • --skip-prefix SINGLE_VALUE->--skip-prefixes VALUE1 VALUE2 (default s3a:// /tmp, shortcut -s remains)

Copy link

sonarcloud bot commented Nov 27, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 1 Security Hotspot
Code Smell A 3 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

idea Catch issues before they fail your Quality Gate with our IDE extension sonarlint SonarLint

@dk1844
Copy link
Contributor Author

dk1844 commented Dec 7, 2023

Merging - tested internally. Jenkins build bears no relevance here, this is a separate migration Python script

@dk1844 dk1844 merged commit a798f84 into develop Dec 7, 2023
3 of 6 checks passed
@dk1844 dk1844 deleted the feature/ecs-paths-mapping-script branch December 7, 2023 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants