Implement tool to find all missing Data Objects from legacy projects #50

mbthornton-lbl · 2024-02-01T19:25:37Z

Description of the Problem:
When running extract-records against legacy study datasets, we are finding missing DataObjects as defined by:

DataObject.id present in one or more has_input/has_output slots
DataObject present in the data_objects.json file in the project's directory on /global/cfs/cdirs/m3408/results/

But:

DataObject.id not found in the data_objects collection in the DB

for the extract-records log:

WARNING:__main__:DataObjectNotFount nmdc:dc2e21becda8d6b010a95897cf97ae90
WARNING:__main__:workflow_record: nmdc:a5fba8fca3b75c9e0b564a8e311adf46, nmdc:ReadbasedAnalysis, ReadBased Analysis Activity for nmdc:mga0k311
WARNING:__main__:has_input: ['nmdc:39dfcb6a3f2afed8306b3666ec98c75b']
WARNING:__main__:has_output: ['nmdc:dc2e21becda8d6b010a95897cf97ae90', 'nmdc:5f407fa0ab30bf4c41ed96f288084147', 'nmdc:425873a08e598b0ca2987ff7b9b5da1f', 'nmdc:b36e981c6d9031d6495c2691fe3523b1', 'nmdc:e798be205cffa0dc38d93310fdaed9ca', 'nmdc:65ebc0e915b82640592a89c406f0465f', 'nmdc:c8ed546d4bb69c4b6f122d933dcb79af', 'nmdc:f61889342f732f8e12b5f818cfe02e7f', 'nmdc:b2ebb165844db26924a6697e7047988b']
WARNING:__main__:omics_processing_record: nmdc:omprc-11-8nny2x31, Metagenome

Looking in the project dir: /global/cfs/cdirs/m3408/results/nmdc:mga0k311/ReadbasedAnalysis/data_objects.json it is the first item on the list:

mbt@perlmutter:login10:/global/cfs/cdirs/m3408/results/nmdc:mga0k311/ReadbasedAnalysis> head data_objects.json 
[
  {
    "description": "Gottcha2 TSV report for gold:Gp0138728", 
    "url": "https://data.microbiomedata.org/data/nmdc:mga0k311/ReadbasedAnalysis/nmdc_mga0k311_gottcha2_report.tsv", 
    "md5_checksum": "dc2e21becda8d6b010a95897cf97ae90", 
    "file_size_bytes": 109, 
    "id": "nmdc:dc2e21becda8d6b010a95897cf97ae90", 
    "name": "gold:Gp0138728_Gottcha2 TSV report"
  },

Proposed Solution:

Extract all data_objects.json files that could be found on /global/cfs/cdirs/m3408/results
Get all data object ID's from all has_input / has_output for omics and workflows for a study
Note all data object ID's that cannot be found in the database
Search for each data object ID in the extracted data object files from the file system
log ID's not found
Write founc data objects to a .json

The text was updated successfully, but these errors were encountered:

mbthornton-lbl · 2024-02-02T22:34:35Z

Results and logs from running this tool against all study collections are here:
https://drive.google.com/drive/folders/1HmadG7Ts6qqLBJX0OCzkF2qDCpBVmgs5?usp=drive_link

mbthornton-lbl self-assigned this Feb 1, 2024

mbthornton-lbl added this to 2024 - Sprint 29 - January 29- February 9, 2024 Feb 1, 2024

mbthornton-lbl moved this to In Progress in 2024 - Sprint 29 - January 29- February 9, 2024 Feb 1, 2024

mbthornton-lbl linked a pull request Feb 5, 2024 that will close this issue

50 implement tool to find all missing data objects from legacy projects #51

Merged

mbthornton-lbl moved this from In Progress to In Review in 2024 - Sprint 29 - January 29- February 9, 2024 Feb 5, 2024

mbthornton-lbl closed this as completed in #51 Feb 7, 2024

github-project-automation bot moved this from In Review to Done in 2024 - Sprint 29 - January 29- February 9, 2024 Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement tool to find all missing Data Objects from legacy projects #50

Implement tool to find all missing Data Objects from legacy projects #50

mbthornton-lbl commented Feb 1, 2024 •

edited

Loading

mbthornton-lbl commented Feb 2, 2024

Implement tool to find all missing Data Objects from legacy projects #50

Implement tool to find all missing Data Objects from legacy projects #50

Comments

mbthornton-lbl commented Feb 1, 2024 • edited Loading

mbthornton-lbl commented Feb 2, 2024

mbthornton-lbl commented Feb 1, 2024 •

edited

Loading