Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

50 implement tool to find all missing data objects from legacy projects #51

Conversation

mbthornton-lbl
Copy link
Contributor

This PR provides a new command-line tool orphan-data-objects

  • find all DataObject ID's from all workflow has_input / has_output slots for a Study
  • Search each ID against the data_objects_set collection of the NMDC database
  • Look for "orphan" DataObject ID's (not in the DB) in an extracted data file pulled from all data_objects.json files that could be found on the filesystem
  • Write JSON of any DataObject(s) that can be recovered from the Data Objects file
  • Write a log file detailing what was found / not found

Results:

https://drive.google.com/drive/folders/1HmadG7Ts6qqLBJX0OCzkF2qDCpBVmgs5?usp=drive_link

Copy link

@Michal-Babins Michal-Babins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded logic in database related logic, added click argument to handle orphaned data objects.

@mbthornton-lbl mbthornton-lbl merged commit 72f2a81 into main Feb 7, 2024
1 check passed
@mbthornton-lbl mbthornton-lbl deleted the 50-implement-tool-to-find-all-missing-data-objects-from-legacy-projects branch February 7, 2024 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement tool to find all missing Data Objects from legacy projects
2 participants