Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompute clustering #18

Merged
merged 22 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
026e765
add clustering data frame to the solution
rcannood Dec 10, 2024
9392ab6
update script
rcannood Dec 10, 2024
013b54c
add comments
rcannood Dec 10, 2024
b075b3c
add clustering key prefix for cluster-based metrics
mumichae Dec 13, 2024
0f99a8d
add resolutions parameters to metrics to make use of precomputed clus…
mumichae Dec 13, 2024
a4404ca
fix clustering key for nmi and ari
mumichae Dec 13, 2024
54c0fd9
set correct version of scib to make using precomputed clusters possible
mumichae Dec 13, 2024
f80d939
add resolutions argument to cluster-based metrics
mumichae Dec 20, 2024
5234d3c
use igraph for clustering on CPU
mumichae Dec 20, 2024
17d436c
use partial reading for clustering
mumichae Dec 20, 2024
e77ad55
rename cluster keys to be consistent with scib metrics
mumichae Dec 20, 2024
810507f
fix import and reading missing slot
mumichae Dec 20, 2024
391e4b2
get clustering from obsm
mumichae Dec 20, 2024
4b95c18
Add config to create test resources script
lazappi Jan 8, 2025
6c56070
Add clustering to benchmark workflow
lazappi Jan 8, 2025
f3dc116
Remove clustering from process dataset workflow
lazappi Jan 8, 2025
b285dbe
Move output processing to subworkflow
lazappi Jan 9, 2025
81f9649
Update API with processing subworkflow
lazappi Jan 9, 2025
3ff4797
Re-enable all methods/metrics
lazappi Jan 9, 2025
5ee87fd
Remove clustering from fil_solution.yaml API file
lazappi Jan 9, 2025
19b6e52
Add processing to test resources script
lazappi Jan 9, 2025
38552af
update readme
rcannood Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,25 +66,25 @@ flowchart TB
file_solution("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-solution'>Solution</a>")
comp_control_method[/"<a href='https://github.com/openproblems-bio/task_batch_integration#component-type-control-method'>Control method</a>"/]
comp_method[/"<a href='https://github.com/openproblems-bio/task_batch_integration#component-type-method'>Method</a>"/]
comp_transformer[/"<a href='https://github.com/openproblems-bio/task_batch_integration#component-type-transform'>Transform</a>"/]
comp_process_integration[/"<a href='https://github.com/openproblems-bio/task_batch_integration#component-type-process-integration'>Process integration</a>"/]
comp_metric[/"<a href='https://github.com/openproblems-bio/task_batch_integration#component-type-metric'>Metric</a>"/]
file_integrated("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-integration'>Integration</a>")
file_integrated_full("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-transformed-integration'>Transformed integration</a>")
file_integrated_processed("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-processed-integration-output'>Processed integration output</a>")
file_score("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-score'>Score</a>")
file_common_dataset---comp_process_dataset
comp_process_dataset-->file_dataset
comp_process_dataset-->file_solution
file_dataset---comp_control_method
file_dataset---comp_method
file_dataset---comp_transformer
file_dataset---comp_process_integration
file_solution---comp_control_method
file_solution---comp_metric
comp_control_method-->file_integrated
comp_method-->file_integrated
comp_transformer-->file_integrated_full
comp_process_integration-->file_integrated_processed
comp_metric-->file_score
file_integrated---comp_transformer
file_integrated_full---comp_metric
file_integrated---comp_process_integration
file_integrated_processed---comp_metric
```

## File format: Common Dataset
Expand Down Expand Up @@ -276,18 +276,19 @@ Arguments:

</div>

## Component type: Transform
## Component type: Process integration

Check the output and transform to create additional output types
Process output from an integration method to the format expected by
metrics

Arguments:

<div class="small">

| Name | Type | Description |
|:---|:---|:---|
| `--input_integrated` | `file` | An integrated AnnData dataset. |
| `--input_dataset` | `file` | Unintegrated AnnData HDF5 file. |
| `--input_integrated` | `file` | An integrated AnnData dataset. |
| `--expected_method_types` | `string` | NA. |
| `--expected_method_types` | `string` | NA. |
| `--expected_method_types` | `string` | NA. |
Expand Down Expand Up @@ -356,12 +357,12 @@ Data structure:

</div>

## File format: Transformed integration
## File format: Processed integration output

An integrated AnnData dataset with additional outputs.

Example file:
`resources_test/task_batch_integration/cxg_immune_cell_atlas/integrated_full.h5ad`
`resources_test/task_batch_integration/cxg_immune_cell_atlas/integrated_processed.h5ad`

Description:

Expand All @@ -379,7 +380,7 @@ Format:
<div class="small">

AnnData object
obsm: 'X_emb'
obsm: 'X_emb', 'clustering'
obsp: 'connectivities', 'distances'
layers: 'corrected_counts'
uns: 'dataset_id', 'normalization_id', 'dataset_organism', 'method_id', 'neighbors'
Expand All @@ -393,6 +394,7 @@ Data structure:
| Slot | Type | Description |
|:---|:---|:---|
| `obsm["X_emb"]` | `double` | (*Optional*) Embedding output - 2D coordinate matrix. |
| `obsm["clustering"]` | `integer` | Leiden clustering results at different resolutions. |
| `obsp["connectivities"]` | `double` | Graph output - neighbor connectivities matrix. |
| `obsp["distances"]` | `double` | Graph output - neighbor distances matrix. |
| `layers["corrected_counts"]` | `double` | (*Optional*) Feature output - corrected counts. |
Expand Down
29 changes: 19 additions & 10 deletions scripts/create_resources/test_resources.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,26 +14,35 @@ DATASET_DIR=resources_test/task_batch_integration
mkdir -p $DATASET_DIR

# process dataset
viash run src/data_processors/process_dataset/config.vsh.yaml -- \
nextflow run . \
-main-script target/nextflow/workflows/process_datasets/main.nf \
-profile docker \
--input "$RAW_DATA/cxg_immune_cell_atlas/dataset.h5ad" \
--output_dataset "$DATASET_DIR/cxg_immune_cell_atlas/dataset.h5ad" \
--output_solution "$DATASET_DIR/cxg_immune_cell_atlas/solution.h5ad"
--publish_dir "$DATASET_DIR/cxg_immune_cell_atlas" \
--output_dataset dataset.h5ad \
--output_solution solution.h5ad \
--output_state state.yaml \
-c common/nextflow_helpers/labels_ci.config \

# run one method
viash run src/methods/combat/config.vsh.yaml -- \
--input $DATASET_DIR/cxg_immune_cell_atlas/dataset.h5ad \
--output $DATASET_DIR/cxg_immune_cell_atlas/integrated.h5ad

# run transformer
viash run src/data_processors/transform/config.vsh.yaml -- \
--input_integrated $DATASET_DIR/cxg_immune_cell_atlas/integrated.h5ad \
--input_dataset $DATASET_DIR/cxg_immune_cell_atlas/dataset.h5ad \
--expected_method_types feature \
--output $DATASET_DIR/cxg_immune_cell_atlas/integrated_full.h5ad
# process integration
nextflow run . \
-main-script target/nextflow/data_processors/process_integration/main.nf \
-profile docker \
--input_dataset "$RAW_DATA/cxg_immune_cell_atlas/dataset.h5ad" \
--input_integrated "$DATASET_DIR/cxg_immune_cell_atlas/integrated.h5ad" \
--expected_method_types feature \
--publish_dir "$DATASET_DIR/cxg_immune_cell_atlas" \
--output integrated_processed.h5ad \
-c common/nextflow_helpers/labels_ci.config \

# run one metric
viash run src/metrics/graph_connectivity/config.vsh.yaml -- \
--input_integrated $DATASET_DIR/cxg_immune_cell_atlas/integrated_full.h5ad \
--input_integrated $DATASET_DIR/cxg_immune_cell_atlas/integrated_processed.h5ad \
--input_solution $DATASET_DIR/cxg_immune_cell_atlas/solution.h5ad \
--output $DATASET_DIR/cxg_immune_cell_atlas/score.h5ad

Expand Down
2 changes: 1 addition & 1 deletion src/api/comp_metric.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ info:
A metric for evaluating batch integration methods.
arguments:
- name: --input_integrated
__merge__: file_integrated_full.yaml
__merge__: file_integrated_processed.yaml
direction: input
required: true
- name: --input_solution
Expand Down
45 changes: 45 additions & 0 deletions src/api/comp_process_integration.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
namespace: data_processors
info:
type: process_integration
type_info:
label: Process integration
summary: Process output from an integration method to the format expected by metrics
description: |
This component will:

- Perform transformations of the integration output
- Cluster the integrated data at different resolutions

argument_groups:
- name: Inputs
arguments:
- name: "--input_dataset"
__merge__: /src/api/file_dataset.yaml
type: file
direction: input
required: true
- name: "--input_integrated"
__merge__: /src/api/file_integrated.yaml
type: file
direction: input
required: true
- name: --expected_method_types
type: string
direction: input
required: true
multiple: true
description: |
The expected output types of the batch integration method.
choices: [ feature, embedding, graph ]
- name: Outputs
arguments:
- name: "--output"
__merge__: file_integrated_processed.yaml
direction: output
required: true

test_resources:
- type: python_script
path: /common/component_tests/run_and_check_output.py
- path: /resources_test/task_batch_integration/cxg_immune_cell_atlas
dest: resources_test/task_batch_integration/cxg_immune_cell_atlas
39 changes: 0 additions & 39 deletions src/api/comp_transformer.yaml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
type: file
example: "resources_test/task_batch_integration/cxg_immune_cell_atlas/integrated_full.h5ad"
label: Transformed integration
example: "resources_test/task_batch_integration/cxg_immune_cell_atlas/integrated_processed.h5ad"
label: Processed integration output
summary: An integrated AnnData dataset with additional outputs.
description: |
Must contain at least one of:
Expand All @@ -23,6 +23,10 @@ info:
name: X_emb
description: Embedding output - 2D coordinate matrix
required: false
- type: integer
name: clustering
description: Leiden clustering results at different resolutions.
required: true
obsp:
- type: double
name: connectivities
Expand Down
30 changes: 30 additions & 0 deletions src/data_processors/precompute_clustering_merge/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: precompute_clustering_merge
namespace: data_processors
label: Merge clustering precomputations
summary: Merge the precompute results of clustering on the input dataset
arguments:
- name: --input
type: file
direction: input
required: true
- name: --output
type: file
direction: output
required: true
- name: --clusterings
type: file
description: Clustering results to merge
direction: input
required: true
multiple: true
resources:
- type: python_script
path: script.py
engines:
- type: docker
image: openproblems/base_python:1.0.0
runners:
- type: executable
- type: nextflow
directives:
label: [midtime, midmem, lowcpu]
28 changes: 28 additions & 0 deletions src/data_processors/precompute_clustering_merge/script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import anndata as ad
import pandas as pd

## VIASH START
par = {
"input": "resources_test/task_batch_integration/cxg_immune_cell_atlas/dataset.h5ad",
"clusterings": ["output.h5ad", "output2.h5ad"],
"output": "output3.h5ad",
}
## VIASH END

print("Read clusterings", flush=True)
clusterings = []
for clus_file in par["clusterings"]:
adata = ad.read_h5ad(clus_file)
obs_filt = adata.obs.filter(regex='leiden_[0-9.]+')
clusterings.append(obs_filt)

print("Merge clusterings", flush=True)
merged = pd.concat(clusterings, axis=1)

print("Read input", flush=True)
input = ad.read_h5ad(par["input"])

input.obsm["clustering"] = merged
mumichae marked this conversation as resolved.
Show resolved Hide resolved

print("Store outputs", flush=True)
input.write_h5ad(par["output"], compression="gzip")
35 changes: 35 additions & 0 deletions src/data_processors/precompute_clustering_run/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: precompute_clustering_run
namespace: data_processors
label: Run clustering precomputations
summary: Run clustering on the input dataset
arguments:
- name: --input
__merge__: /src/api/file_common_dataset.yaml
direction: input
required: true
- name: --output
__merge__: /src/api/file_dataset.yaml
direction: output
required: true
- type: double
name: resolution
default: 0.8
description: Resolution parameter for clustering
resources:
- type: python_script
path: script.py
- path: /src/utils/read_anndata_partial.py
engines:
- type: docker
image: openproblems/base_python:1.0.0
setup:
- type: python
pypi:
- scanpy
- igraph
- leidenalg
runners:
- type: executable
- type: nextflow
directives:
label: [midtime, midmem, lowcpu]
50 changes: 50 additions & 0 deletions src/data_processors/precompute_clustering_run/script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import sys
import anndata as ad

# check if we can use GPU
USE_GPU = False
try:
import subprocess
assert subprocess.run('nvidia-smi', shell=True, stdout=subprocess.DEVNULL).returncode == 0
from rapids_singlecell.tl import leiden
USE_GPU = True
except Exception as e:
mumichae marked this conversation as resolved.
Show resolved Hide resolved
from scanpy.tl import leiden

## VIASH START
par = {
"input": "resources_test/task_batch_integration/cxg_immune_cell_atlas/dataset.h5ad",
"output": "output.h5ad",
"resolution": 0.8,
}
## VIASH END

sys.path.append(meta["resources_dir"])
from read_anndata_partial import read_anndata

n_cell_cpu = 300_000

print("Read input", flush=True)
input = read_anndata(par["input"], obs='obs', obsp='obsp', uns='uns')

key = f'leiden_{par["resolution"]}'
kwargs = dict()
if not USE_GPU:
kwargs |= dict(
flavor='igraph',
n_iterations=2,
)

mumichae marked this conversation as resolved.
Show resolved Hide resolved
print(f"Run Leiden clustering with {kwargs}", flush=True)
leiden(
input,
resolution=par["resolution"],
key_added=key,
mumichae marked this conversation as resolved.
Show resolved Hide resolved
**kwargs,
)

print("Store outputs", flush=True)
output = ad.AnnData(
obs=input.obs[[key]],
)
output.write_h5ad(par["output"], compression="gzip")
Loading
Loading