Merge pull request #21 from microbiomedata/import_workflow_automation

Import workflow automation
microbiomedata · Nov 29, 2023 · 4aea7b0 · 4aea7b0
2 parents 013fe55 + 539bef2
commit 4aea7b0
Show file tree

Hide file tree

Showing 82 changed files with 15,813 additions and 1,478 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -1,4 +1,2 @@
 [run]
 branch = True
-omit =
-    src/extract.py
diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,4 @@ test_data/afile.sha256
 htmlcov/
 .coverage
 attic
+.idea/
diff --git a/Makefile b/Makefile
@@ -1,3 +1,3 @@
 
 test:
-	PYTHONPATH=$(shell pwd) pytest --cov-report term --cov-report html --cov=src
+	PYTHONPATH=$(shell pwd) pytest --cov-report term --cov-report html --cov=nmdc_automation ./tests
diff --git a/README.md b/README.md
@@ -29,8 +29,35 @@ The main scheduling loop does the following:
    generate a new job.
 4. Generate a job record for each object.  Use the workflow spec to populate the inputs.
 
+## Install Dependencies
+To install the environment using poetry, there are a few steps to take. 
+If Poetry is no installed, run:
+`pip install poetry`
+
+Once poetry is installed, you can run:
+`poetry install` 
+
+To use the environment, you can shell into the env:
+`poetry shell`
+
 
 ## Implementation
+This package is meant to be used on NMDC approvied compute instances with directories that can be accessed via https and are linked to the microbiomedata.org/data endpoint.
+
+The main python drivers can be found in the `nmdc_automation/run_process directory` that contians two processes that require configurations to be supplied. 
+
+#### Run NMDC Workflows with corresponding omics processing records
+`nmdc_automation/run_process/run_worklfows.py` will automate job claims, job processing, and analysis record and data object submission via the nmdc runtime-api.
+To submit a process that will spawn a daemon that will claim, process, and submit all jobs that have not been claimed, `cd` in to `nmdc_automation/run_process`
+and run `python run_workflows.py watcher --config ../../configs/site_configuration_nersc.toml daemon`, this will watch for omics processing records that have not been claimed and processed. 
+
+#### Run Workflow import for data processed by non NMDC workflows
+`nmdc_automation/run_process/run_worklfows.py` is designed to take in data files avilable on disk, transform them into NMDC analysis records, and submit them back to the central data store via runtime-api. Currently this process is only suitable for data processed at JGI, but with collaboration, data from other processing centers could be transformed and ingested into NMDC. 
+To submit the import process, `cd` in `nmdc_automation/run_process` and run `python run_import.py project-import import.tsv ../../configs/import.yaml`, where import.tsv expects the follow format:
+
+
+| omics_id | project_id | directory |
+|----------|------------|-----------|
+|nmdc:omprc-11-q8b9dh63 | Ga0597031  | /path/to/project/Ga0597031 |
+
 
-* src/job_finder.py has most of the key logic for the runtime scheduling piece
-* src/submitter.py has a rough implementation of what would happen on the compute/cromwell side.
-Original file line number
+Diff line change
@@ Expand Up / @@ -5,3 +5,4 @@ test_data/afile.sha256 @@
     htmlcov/
     .coverage
     attic
+    .idea/