Skip to content

Commit

Permalink
Merge pull request #21 from microbiomedata/import_workflow_automation
Browse files Browse the repository at this point in the history
Import workflow automation
  • Loading branch information
mbthornton-lbl authored Nov 29, 2023
2 parents 013fe55 + 539bef2 commit 4aea7b0
Show file tree
Hide file tree
Showing 82 changed files with 15,813 additions and 1,478 deletions.
2 changes: 0 additions & 2 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -1,4 +1,2 @@
[run]
branch = True
omit =
src/extract.py
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ test_data/afile.sha256
htmlcov/
.coverage
attic
.idea/
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@

test:
PYTHONPATH=$(shell pwd) pytest --cov-report term --cov-report html --cov=src
PYTHONPATH=$(shell pwd) pytest --cov-report term --cov-report html --cov=nmdc_automation ./tests
31 changes: 29 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,35 @@ The main scheduling loop does the following:
generate a new job.
4. Generate a job record for each object. Use the workflow spec to populate the inputs.

## Install Dependencies
To install the environment using poetry, there are a few steps to take.
If Poetry is no installed, run:
`pip install poetry`

Once poetry is installed, you can run:
`poetry install`

To use the environment, you can shell into the env:
`poetry shell`


## Implementation
This package is meant to be used on NMDC approvied compute instances with directories that can be accessed via https and are linked to the microbiomedata.org/data endpoint.

The main python drivers can be found in the `nmdc_automation/run_process directory` that contians two processes that require configurations to be supplied.

#### Run NMDC Workflows with corresponding omics processing records
`nmdc_automation/run_process/run_worklfows.py` will automate job claims, job processing, and analysis record and data object submission via the nmdc runtime-api.
To submit a process that will spawn a daemon that will claim, process, and submit all jobs that have not been claimed, `cd` in to `nmdc_automation/run_process`
and run `python run_workflows.py watcher --config ../../configs/site_configuration_nersc.toml daemon`, this will watch for omics processing records that have not been claimed and processed.

#### Run Workflow import for data processed by non NMDC workflows
`nmdc_automation/run_process/run_worklfows.py` is designed to take in data files avilable on disk, transform them into NMDC analysis records, and submit them back to the central data store via runtime-api. Currently this process is only suitable for data processed at JGI, but with collaboration, data from other processing centers could be transformed and ingested into NMDC.
To submit the import process, `cd` in `nmdc_automation/run_process` and run `python run_import.py project-import import.tsv ../../configs/import.yaml`, where import.tsv expects the follow format:


| omics_id | project_id | directory |
|----------|------------|-----------|
|nmdc:omprc-11-q8b9dh63 | Ga0597031 | /path/to/project/Ga0597031 |


* src/job_finder.py has most of the key logic for the runtime scheduling piece
* src/submitter.py has a rough implementation of what would happen on the compute/cromwell side.
Loading

0 comments on commit 4aea7b0

Please sign in to comment.