-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ecarton/cumulus 3751 s3 task #3910
base: feature/CUMULUS-3751
Are you sure you want to change the base?
Conversation
…ecarton/CUMULUS-3751-as-separate-task
… AL2023 (#3870) * Updated user-data for compatibility with Amazon Linux 2023 AMI * use amazon 2023 ami * install lvm2 * fix task-reaper * add more log for debug * update changelog * test backward compatibility image_id_ecs_amz2 * change back to /ngap/amis/image_id_ecs_al2023_x86 * update ecs user data use IMDSv2 * update fakeprovider IMDSv2 template aws cli v2 * update fake-provider --------- Co-authored-by: mikedorfman <[email protected]>
adding cbanh CI stack
* Update migration to remove vacuum statements that can time out in a Lambda env * Remove additional vacuum statements * Add CL
return false; | ||
} | ||
|
||
const sourceExists = await S3.s3ObjectExists({ Bucket: sourceFile.bucket, Key: sourceFile.key }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: these could probably be run in parallel for efficiency reasons.
} | ||
|
||
/** | ||
* Validates the file matched only one collection and has a valid bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: collection configuration regular expression
zip(sourceGranules, targetGranules), | ||
async ([sourceGranule, targetGranule]) => { | ||
if (!sourceGranule?.files || !targetGranule?.files) { | ||
throw new AssertionError({ message: 'size mismatch between target and source granules' }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this error may not be descriptive of the issue, meaning 'mismatch' doesn't have meaning unless you say target and source granule arrays, and also that it's possible this error occurs not because of a size mismatch, but because a granule record has no files.
} | ||
})); | ||
}, | ||
{ concurrency: getConcurrency() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - concurrency is per granule, but not per file? Is that acceptable in a case where granules have 1000 files? 10k? etc.
Co-authored-by: Jonathan Kovarik <[email protected]>
Co-authored-by: Jonathan Kovarik <[email protected]>
} | ||
|
||
/** | ||
* Update the cmr objects to contain data adherant to the target granules they reflect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Update the cmr objects to contain data adherant to the target granules they reflect | |
* Update the cmr objects to contain data adherent to the target granules they reflect |
} | ||
|
||
/** | ||
* Create new granule object with updated details including updated files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: specify updates
* Create new granule object with updated details including updated files | ||
* all according to the given target collection | ||
*/ | ||
async function updateGranuleMetadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this method need to be async?
const bucketsConfig = new BucketsConfig(config.buckets); | ||
const targetGranules: Array<ValidGranuleRecord> = []; | ||
const granulesAndMetadata = await Promise.all(granules.map( | ||
async (granule) => updateGranuleMetadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does updateGranuleMetadata need to be async?
* Build a set of granules according to new collection | ||
* New granules reference new collectionId as their collectionId | ||
* files for new granules are updated according to new collection url_path | ||
* file names are *not* updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not update bucket/key for the files?
return targetGranules; | ||
} | ||
|
||
async function changeGranuleCollectionS3(event: ChangeCollectionsS3Event): Promise<Object> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the explicit return cast here or make it more explicit - this winds up making the return type less precise.
? config.s3MultipartChunksizeMb : Number(process.env.default_s3_multipart_chunksize_mb); | ||
|
||
const chunkSize = s3MultipartChunksizeMb ? s3MultipartChunksizeMb * MB : undefined; | ||
const targetCollection = await getCollection({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just get this as part of the message config meta? One lookup doesn't feel terribly ... onerous, but given we're already doing database lookups in the trigger endpoint it might be good to try to pull it from there for performance reasons?
collectionVersion: config.targetCollection.version, | ||
}); | ||
|
||
log.debug(`change-granule-collection-s3 config: ${JSON.stringify(event.config)}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic: we should consider doing a quick audit of the code once this gets near merge and make sure we have reasonable logging coverage to make support easier.
log.debug(`change-granule-collection-s3 config: ${JSON.stringify(event.config)}`); | ||
|
||
const granuleIds = event.input.granuleIds; | ||
const tempGranulesInput = await Promise.all(granuleIds.map((granuleId) => getGranule({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: double-check retry behavior here (it's currently using the api-client defaults) for sanity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - this is fine for now, we should probably capture this as a good point for a bulk query perf improvement point.
throw new Error(`granule has unparseable file details ${granule}`); | ||
} | ||
}); | ||
granulesInput = tempGranulesInput as ValidGranuleRecord[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a forced type assertion, consider using something like the following with size checks:
granulesInput = tempGranulesInput as ValidGranuleRecord[]; | |
granulesInput = tempGranulesInput.filter(apiGranuleRecordIsValid); |
} | ||
|
||
async function changeGranuleCollectionS3(event: ChangeCollectionsS3Event): Promise<Object> { | ||
const config = event.config; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: re-org variable declarations into paragraphs, put debug log at end or near the config it's logging, and have paragraph break between declaration block and logic
if (invalidBehavior === 'skip') { | ||
granulesInput = tempGranulesInput.filter((granule) => { | ||
if (!apiGranuleRecordIsValid(granule)) { | ||
log.warn(`granule has unparseable file details ${granule}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably include logging in the validator to tell which file has the unparsable details. In a granule with 1k files that would be obnoxious 😅
} else { | ||
tempGranulesInput.forEach((granule) => { | ||
if (!apiGranuleRecordIsValid(granule)) { | ||
throw new Error(`granule has unparseable file details ${granule}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message looks like a copy/not update. Something like 'granule has validation error' or the like?
prefix: getRequiredEnvVar('stackName'), | ||
granuleId, | ||
}))); | ||
const invalidBehavior = config.invalidBehavior || 'skip'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: invalidGranuleBehavior?
}))); | ||
const invalidBehavior = config.invalidBehavior || 'skip'; | ||
let granulesInput: Array<ValidGranuleRecord>; | ||
if (invalidBehavior === 'skip') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: block comment here. Also consider abstracting and sub-unit testing this block ( I say that having not looked at the units yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also super nit, but consider refactor toward doing something a bit more functional(tm) like:
main() {
granulesInput = generateValidInput(<>)
granulesCmrObject = granulesToCmrFileObjects(<>)
{ old, new } = updateMetadata(<>)
}
for maintenance/etc reasons.
const cmrFiles: Array<ValidApiFile> = granulesToCmrFileObjects( | ||
granulesInput, | ||
isCMRFile | ||
) as ValidApiFile[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we making a bold assertion here?
Summary: 3751 just the s3 copy part
Addresses CUMULUS-3751: Move granules across collections
Changes
PR Checklist