Prepare genome metadata with datasets #326

MatBarba · 2024-03-19T15:34:02Z

Instead of using a request to the ENA API, get the metadata from the JSON report from NCBI's datasets.
This simplifies the code (and the tests) since we don't need to check/mock that part.

Now we need to download the dataset file before running the script, so I've updated the Nextflow module to do so.

I've also modified the prepare part in to incorporate the "annotation" part of the genome json if there are annotations, and we can use this information to determine how many files we expect at the end of the Nextflow pipeline (3 or 6).
Also note that I've replaced concat by mix, since concat actually waits for the whole list of each channel before passing the next one, and mix push they as they arrive.

This change makes the groupTuple() work correctly: now instead of waiting for all the channels to finish, it starts as soon as it has all the files for a genome.

JAlvarezJarreta

My main concern about this approach is that it forces datasets to be installed for it to work (or procure a container with python, ensembl-genomio and nextflow). Since this is out of scope for the time being, what about splitting prepare_genome_metadata.nf in two: one where datasets is run (so we can use Lahcen's container), and another where genome_metadata_prepare is run, so the local environment is used.

pipelines/nextflow/modules/genome_metadata/prepare_genome_metadata.nf

MatBarba · 2024-03-19T16:16:57Z

My main concern about this approach is that it forces datasets to be installed for it to work (or procure a container with python, ensembl-genomio and nextflow). Since this is out of scope for the time being, what about splitting prepare_genome_metadata.nf in two: one where datasets is run (so we can use Lahcen's container), and another where genome_metadata_prepare is run, so the local environment is used.

Yes I was considering doing this as well since it'd make caching easier too

Co-authored-by: J. Alvarez-Jarreta <[email protected]>

MatBarba · 2024-03-20T12:35:46Z

I've separated the datasets download from the preparation step (and I've made it so that the datasets json files are cached, like the downloaded files)

MatBarba · 2024-03-20T16:13:46Z

I have fixed a few issues, notably where I was not in fact using the correct annotation flag. I've also added a separate step so that the download of the summary file only requires an accession value.

JAlvarezJarreta

Quite a nice update of the codebase, showing the potential of datasets CLI tool 🤓

pipelines/nextflow/modules/genome_metadata/datasets_metadata.nf

pipelines/nextflow/modules/genome_metadata/prepare_genome_metadata.nf

pipelines/nextflow/modules/genome_metadata/datasets_metadata.nf

pipelines/nextflow/workflows/genome_prepare/main.nf

src/python/ensembl/io/genomio/genome_metadata/prepare.py

src/python/tests/genome_metadata/test_prepare.py

Co-authored-by: J. Alvarez-Jarreta <[email protected]>

JAlvarezJarreta

One tiny docstring update, but this is now ready to go!

src/python/ensembl/io/genomio/genome_metadata/prepare.py

Co-authored-by: J. Alvarez-Jarreta <[email protected]>

MatBarba added 9 commits March 19, 2024 13:31

Use NCBI datasets report instead of request API

447196b

Update organism part

560361e

Update tests with datasets change

e85ea90

Get ncbi metadata from datasets

e035baf

replace concat by mix, fix input

6c362dc

Update groupkey for grouptuple

3459e06

Merge branch 'main' into mbarba/prepare_with_datasets

967af75

format

1085f39

Merge branch 'hackathon/feb24' into mbarba/prepare_with_datasets

e84488e

MatBarba requested a review from JAlvarezJarreta March 19, 2024 15:34

JAlvarezJarreta assigned MatBarba Mar 19, 2024

JAlvarezJarreta requested a review from ens-LCampbell March 19, 2024 15:53

JAlvarezJarreta reviewed Mar 19, 2024

View reviewed changes

pipelines/nextflow/modules/genome_metadata/prepare_genome_metadata.nf Outdated Show resolved Hide resolved

MatBarba and others added 3 commits March 19, 2024 16:59

Apply suggestions from code review

27d965b

Co-authored-by: J. Alvarez-Jarreta <[email protected]>

Fix date, use mock

d78439f

Separate datasets download

90d8b98

MatBarba requested a review from JAlvarezJarreta March 20, 2024 12:35

MatBarba added 5 commits March 20, 2024 13:27

Add cached to datasets metadata

e29d1f2

Check empty file

ea71fbf

Fix grouptuple check

77ed8b2

Fix: expose has_annotation

0ddb6dc

Separate accession module

ec0e4d8

JAlvarezJarreta requested changes Mar 20, 2024

View reviewed changes

Apply suggestions from code review

40d3c87

Co-authored-by: J. Alvarez-Jarreta <[email protected]>

MatBarba requested a review from JAlvarezJarreta March 22, 2024 12:25

JAlvarezJarreta approved these changes Mar 22, 2024

View reviewed changes

src/python/ensembl/io/genomio/genome_metadata/prepare.py Outdated Show resolved Hide resolved

Update src/python/ensembl/io/genomio/genome_metadata/prepare.py

ecd90d7

Co-authored-by: J. Alvarez-Jarreta <[email protected]>

MatBarba merged commit 8f791e7 into hackathon/feb24 Mar 22, 2024
1 check was pending

MatBarba deleted the mbarba/prepare_with_datasets branch March 22, 2024 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare genome metadata with datasets #326

Prepare genome metadata with datasets #326

MatBarba commented Mar 19, 2024

JAlvarezJarreta left a comment

MatBarba commented Mar 19, 2024

MatBarba commented Mar 20, 2024

MatBarba commented Mar 20, 2024

JAlvarezJarreta left a comment

JAlvarezJarreta left a comment

Prepare genome metadata with datasets #326

Prepare genome metadata with datasets #326

Conversation

MatBarba commented Mar 19, 2024

JAlvarezJarreta left a comment

Choose a reason for hiding this comment

MatBarba commented Mar 19, 2024

MatBarba commented Mar 20, 2024

MatBarba commented Mar 20, 2024

JAlvarezJarreta left a comment

Choose a reason for hiding this comment

JAlvarezJarreta left a comment

Choose a reason for hiding this comment