Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ecarton/cumulus 3751 s3 task #3910

Open
wants to merge 348 commits into
base: feature/CUMULUS-3751
Choose a base branch
from

Conversation

etcart
Copy link
Contributor

@etcart etcart commented Jan 28, 2025

Summary: 3751 just the s3 copy part
Addresses CUMULUS-3751: Move granules across collections

Changes

  • task for copying granules' s3 files from one collection to another
  • workflow and integration test for this s3 file copy, which is intended to be expanded to include the rest of the workflow as those are ready

PR Checklist

  • Update CHANGELOG
  • Unit tests
  • Ad-hoc testing - Deploy changes and test manually
  • Integration tests

etcart and others added 30 commits December 6, 2024 17:32
… AL2023 (#3870)

* Updated user-data for compatibility with Amazon Linux 2023 AMI

* use amazon 2023 ami

* install lvm2

* fix task-reaper

* add more log for debug

* update changelog

* test backward compatibility image_id_ecs_amz2

* change back to /ngap/amis/image_id_ecs_al2023_x86

* update ecs user data use IMDSv2

* update fakeprovider IMDSv2 template aws cli v2

* update fake-provider

---------

Co-authored-by: mikedorfman <[email protected]>
* Update migration to remove vacuum statements that can time out in a Lambda env

* Remove additional vacuum statements

* Add CL
return false;
}

const sourceExists = await S3.s3ObjectExists({ Bucket: sourceFile.bucket, Key: sourceFile.key });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: these could probably be run in parallel for efficiency reasons.

}

/**
* Validates the file matched only one collection and has a valid bucket
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: collection configuration regular expression

zip(sourceGranules, targetGranules),
async ([sourceGranule, targetGranule]) => {
if (!sourceGranule?.files || !targetGranule?.files) {
throw new AssertionError({ message: 'size mismatch between target and source granules' });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this error may not be descriptive of the issue, meaning 'mismatch' doesn't have meaning unless you say target and source granule arrays, and also that it's possible this error occurs not because of a size mismatch, but because a granule record has no files.

}
}));
},
{ concurrency: getConcurrency() }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - concurrency is per granule, but not per file? Is that acceptable in a case where granules have 1000 files? 10k? etc.

}

/**
* Update the cmr objects to contain data adherant to the target granules they reflect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Update the cmr objects to contain data adherant to the target granules they reflect
* Update the cmr objects to contain data adherent to the target granules they reflect

}

/**
* Create new granule object with updated details including updated files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: specify updates

* Create new granule object with updated details including updated files
* all according to the given target collection
*/
async function updateGranuleMetadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this method need to be async?

const bucketsConfig = new BucketsConfig(config.buckets);
const targetGranules: Array<ValidGranuleRecord> = [];
const granulesAndMetadata = await Promise.all(granules.map(
async (granule) => updateGranuleMetadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does updateGranuleMetadata need to be async?

* Build a set of granules according to new collection
* New granules reference new collectionId as their collectionId
* files for new granules are updated according to new collection url_path
* file names are *not* updated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not update bucket/key for the files?

return targetGranules;
}

async function changeGranuleCollectionS3(event: ChangeCollectionsS3Event): Promise<Object> {
Copy link
Member

@Jkovarik Jkovarik Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the explicit return cast here or make it more explicit - this winds up making the return type less precise.

? config.s3MultipartChunksizeMb : Number(process.env.default_s3_multipart_chunksize_mb);

const chunkSize = s3MultipartChunksizeMb ? s3MultipartChunksizeMb * MB : undefined;
const targetCollection = await getCollection({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just get this as part of the message config meta? One lookup doesn't feel terribly ... onerous, but given we're already doing database lookups in the trigger endpoint it might be good to try to pull it from there for performance reasons?

collectionVersion: config.targetCollection.version,
});

log.debug(`change-granule-collection-s3 config: ${JSON.stringify(event.config)}`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic: we should consider doing a quick audit of the code once this gets near merge and make sure we have reasonable logging coverage to make support easier.

log.debug(`change-granule-collection-s3 config: ${JSON.stringify(event.config)}`);

const granuleIds = event.input.granuleIds;
const tempGranulesInput = await Promise.all(granuleIds.map((granuleId) => getGranule({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: double-check retry behavior here (it's currently using the api-client defaults) for sanity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - this is fine for now, we should probably capture this as a good point for a bulk query perf improvement point.

throw new Error(`granule has unparseable file details ${granule}`);
}
});
granulesInput = tempGranulesInput as ValidGranuleRecord[];
Copy link
Member

@Jkovarik Jkovarik Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a forced type assertion, consider using something like the following with size checks:

Suggested change
granulesInput = tempGranulesInput as ValidGranuleRecord[];
granulesInput = tempGranulesInput.filter(apiGranuleRecordIsValid);

}

async function changeGranuleCollectionS3(event: ChangeCollectionsS3Event): Promise<Object> {
const config = event.config;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: re-org variable declarations into paragraphs, put debug log at end or near the config it's logging, and have paragraph break between declaration block and logic

if (invalidBehavior === 'skip') {
granulesInput = tempGranulesInput.filter((granule) => {
if (!apiGranuleRecordIsValid(granule)) {
log.warn(`granule has unparseable file details ${granule}`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably include logging in the validator to tell which file has the unparsable details. In a granule with 1k files that would be obnoxious 😅

} else {
tempGranulesInput.forEach((granule) => {
if (!apiGranuleRecordIsValid(granule)) {
throw new Error(`granule has unparseable file details ${granule}`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message looks like a copy/not update. Something like 'granule has validation error' or the like?

prefix: getRequiredEnvVar('stackName'),
granuleId,
})));
const invalidBehavior = config.invalidBehavior || 'skip';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: invalidGranuleBehavior?

})));
const invalidBehavior = config.invalidBehavior || 'skip';
let granulesInput: Array<ValidGranuleRecord>;
if (invalidBehavior === 'skip') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: block comment here. Also consider abstracting and sub-unit testing this block ( I say that having not looked at the units yet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also super nit, but consider refactor toward doing something a bit more functional(tm) like:

main() {
granulesInput = generateValidInput(<>)
granulesCmrObject = granulesToCmrFileObjects(<>)
{ old, new } = updateMetadata(<>)
}

for maintenance/etc reasons.

const cmrFiles: Array<ValidApiFile> = granulesToCmrFileObjects(
granulesInput,
isCMRFile
) as ValidApiFile[];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we making a bold assertion here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants