Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement tool to find all missing Data Objects from legacy projects #50

Closed
mbthornton-lbl opened this issue Feb 1, 2024 · 1 comment · Fixed by #51
Closed

Implement tool to find all missing Data Objects from legacy projects #50

mbthornton-lbl opened this issue Feb 1, 2024 · 1 comment · Fixed by #51
Assignees

Comments

@mbthornton-lbl
Copy link
Contributor

mbthornton-lbl commented Feb 1, 2024

Description of the Problem:
When running extract-records against legacy study datasets, we are finding missing DataObjects as defined by:

  • DataObject.id present in one or more has_input/has_output slots
  • DataObject present in the data_objects.json file in the project's directory on /global/cfs/cdirs/m3408/results/

But:

  • DataObject.id not found in the data_objects collection in the DB

for the extract-records log:

WARNING:__main__:DataObjectNotFount nmdc:dc2e21becda8d6b010a95897cf97ae90
WARNING:__main__:workflow_record: nmdc:a5fba8fca3b75c9e0b564a8e311adf46, nmdc:ReadbasedAnalysis, ReadBased Analysis Activity for nmdc:mga0k311
WARNING:__main__:has_input: ['nmdc:39dfcb6a3f2afed8306b3666ec98c75b']
WARNING:__main__:has_output: ['nmdc:dc2e21becda8d6b010a95897cf97ae90', 'nmdc:5f407fa0ab30bf4c41ed96f288084147', 'nmdc:425873a08e598b0ca2987ff7b9b5da1f', 'nmdc:b36e981c6d9031d6495c2691fe3523b1', 'nmdc:e798be205cffa0dc38d93310fdaed9ca', 'nmdc:65ebc0e915b82640592a89c406f0465f', 'nmdc:c8ed546d4bb69c4b6f122d933dcb79af', 'nmdc:f61889342f732f8e12b5f818cfe02e7f', 'nmdc:b2ebb165844db26924a6697e7047988b']
WARNING:__main__:omics_processing_record: nmdc:omprc-11-8nny2x31, Metagenome

Looking in the project dir: /global/cfs/cdirs/m3408/results/nmdc:mga0k311/ReadbasedAnalysis/data_objects.json it is the first item on the list:

mbt@perlmutter:login10:/global/cfs/cdirs/m3408/results/nmdc:mga0k311/ReadbasedAnalysis> head data_objects.json 
[
  {
    "description": "Gottcha2 TSV report for gold:Gp0138728", 
    "url": "https://data.microbiomedata.org/data/nmdc:mga0k311/ReadbasedAnalysis/nmdc_mga0k311_gottcha2_report.tsv", 
    "md5_checksum": "dc2e21becda8d6b010a95897cf97ae90", 
    "file_size_bytes": 109, 
    "id": "nmdc:dc2e21becda8d6b010a95897cf97ae90", 
    "name": "gold:Gp0138728_Gottcha2 TSV report"
  }, 

Proposed Solution:

  • Extract all data_objects.json files that could be found on /global/cfs/cdirs/m3408/results
  • Get all data object ID's from all has_input / has_output for omics and workflows for a study
  • Note all data object ID's that cannot be found in the database
  • Search for each data object ID in the extracted data object files from the file system
  • log ID's not found
  • Write founc data objects to a .json
@mbthornton-lbl
Copy link
Contributor Author

Results and logs from running this tool against all study collections are here:
https://drive.google.com/drive/folders/1HmadG7Ts6qqLBJX0OCzkF2qDCpBVmgs5?usp=drive_link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
1 participant