-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare genome metadata with datasets #326
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern about this approach is that it forces datasets
to be installed for it to work (or procure a container with python, ensembl-genomio and nextflow). Since this is out of scope for the time being, what about splitting prepare_genome_metadata.nf
in two: one where datasets
is run (so we can use Lahcen's container), and another where genome_metadata_prepare
is run, so the local environment is used.
pipelines/nextflow/modules/genome_metadata/prepare_genome_metadata.nf
Outdated
Show resolved
Hide resolved
Yes I was considering doing this as well since it'd make caching easier too |
Co-authored-by: J. Alvarez-Jarreta <[email protected]>
I've separated the datasets download from the preparation step (and I've made it so that the datasets json files are cached, like the downloaded files) |
I have fixed a few issues, notably where I was not in fact using the correct annotation flag. I've also added a separate step so that the download of the summary file only requires an accession value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a nice update of the codebase, showing the potential of datasets
CLI tool 🤓
pipelines/nextflow/modules/genome_metadata/datasets_metadata.nf
Outdated
Show resolved
Hide resolved
Co-authored-by: J. Alvarez-Jarreta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tiny docstring update, but this is now ready to go!
Co-authored-by: J. Alvarez-Jarreta <[email protected]>
Instead of using a request to the ENA API, get the metadata from the JSON report from NCBI's datasets.
This simplifies the code (and the tests) since we don't need to check/mock that part.
Now we need to download the dataset file before running the script, so I've updated the Nextflow module to do so.
I've also modified the prepare part in to incorporate the "annotation" part of the genome json if there are annotations, and we can use this information to determine how many files we expect at the end of the Nextflow pipeline (3 or 6).
Also note that I've replaced
concat
bymix
, sinceconcat
actually waits for the whole list of each channel before passing the next one, and mix push they as they arrive.This change makes the
groupTuple()
work correctly: now instead of waiting for all the channels to finish, it starts as soon as it has all the files for a genome.