Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare genome metadata with datasets #326

Merged
merged 19 commits into from
Mar 22, 2024

Conversation

MatBarba
Copy link
Contributor

Instead of using a request to the ENA API, get the metadata from the JSON report from NCBI's datasets.
This simplifies the code (and the tests) since we don't need to check/mock that part.

Now we need to download the dataset file before running the script, so I've updated the Nextflow module to do so.

I've also modified the prepare part in to incorporate the "annotation" part of the genome json if there are annotations, and we can use this information to determine how many files we expect at the end of the Nextflow pipeline (3 or 6).
Also note that I've replaced concat by mix, since concat actually waits for the whole list of each channel before passing the next one, and mix push they as they arrive.

This change makes the groupTuple() work correctly: now instead of waiting for all the channels to finish, it starts as soon as it has all the files for a genome.

Copy link
Contributor

@JAlvarezJarreta JAlvarezJarreta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern about this approach is that it forces datasets to be installed for it to work (or procure a container with python, ensembl-genomio and nextflow). Since this is out of scope for the time being, what about splitting prepare_genome_metadata.nf in two: one where datasets is run (so we can use Lahcen's container), and another where genome_metadata_prepare is run, so the local environment is used.

@MatBarba
Copy link
Contributor Author

My main concern about this approach is that it forces datasets to be installed for it to work (or procure a container with python, ensembl-genomio and nextflow). Since this is out of scope for the time being, what about splitting prepare_genome_metadata.nf in two: one where datasets is run (so we can use Lahcen's container), and another where genome_metadata_prepare is run, so the local environment is used.

Yes I was considering doing this as well since it'd make caching easier too

@MatBarba
Copy link
Contributor Author

I've separated the datasets download from the preparation step (and I've made it so that the datasets json files are cached, like the downloaded files)

@MatBarba
Copy link
Contributor Author

I have fixed a few issues, notably where I was not in fact using the correct annotation flag. I've also added a separate step so that the download of the summary file only requires an accession value.

Copy link
Contributor

@JAlvarezJarreta JAlvarezJarreta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a nice update of the codebase, showing the potential of datasets CLI tool 🤓

Co-authored-by: J. Alvarez-Jarreta <[email protected]>
Copy link
Contributor

@JAlvarezJarreta JAlvarezJarreta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny docstring update, but this is now ready to go!

@MatBarba MatBarba merged commit 8f791e7 into hackathon/feb24 Mar 22, 2024
1 check was pending
@MatBarba MatBarba deleted the mbarba/prepare_with_datasets branch March 22, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants