-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add assign_clade method to CladeTime class #57
Conversation
Since it's possible to mix and match sequence_as_of and tree_as_of dates in cladetime, sequences and reference trees may have different ncov_metadata attributes (dataset version, nexclade cli version, for example) Add an ncov_metadata property to Tree that reflects metadata for the tree_as_of date (as opposed to CladeTime's ncov_metadata property, which reflects sequence_as_of). We'll use this new property to make sure we're using the correct nextclade dataset when assigning clades.
Still in the NCBI mindset, earlier versions of sequence.filter used accession numbers to compare .fasta records to a set of sequence "ids". However, for the processed Nextstrain sequences, we need to use the "strain" column
We will need to instantiate a Tree object from CladeTime when assigning clade sequences. Thus, we shouldn't use CladeTime objects to do this because circulate dependencies
Adding these parameters allows additional filtering on sequence metadata for min and max collection dates. This is in support of clade assignemnts, where we'll only want to assign clades to a small subset of sequences based on their collection date. Behavior is unchanged if these new parameters aren't specified.
This will allow re-use of that function when working with collection begin/end dates in sequence assignment Additional test cases for date commit
This new method is how clade time users (including people using the upcoming CLI) will do custom clade assignments. After validating dates, assign_clades calls out to existing functions, performing a kind of "mini pipeline" to return a LazyFrame with the results from Nextclade merged with metdata from the sequences being assigned.
This changeset represents new tests for the assign_clades method, as well as updates that reflect some refactoring that occurred along the way.
This changeset returns a summarized version of the clade assignments as well as some metadata about the clade assignment process.
1fc1354
to
c9c09d1
Compare
src/cladetime/tree.py
Outdated
self._nextclade_data_url = self._clade_time._config.nextclade_data_url | ||
self._nextclade_data_url_version = self._clade_time._config.nextclade_data_url_version | ||
self._tree_name = self._clade_time._config.nextclade_input_tree_name | ||
self._config = self._clade_time._config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that future commits remove CladeTime
from Tree
instantiation to avoid circular dependencies, so all of the awkward references to _clade_time_config
and the like will disappear
@@ -125,14 +148,6 @@ def _get_tree_url(self): | |||
) | |||
return tree_url | |||
|
|||
def _get_url_ncov_metadata(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed because we're setting the Tree
ncov metadata information the same way we set the CladeTime
ncov metadata property
@@ -71,7 +71,7 @@ def _get_s3_object_url(bucket_name: str, object_key: str, date: datetime) -> Tup | |||
|
|||
|
|||
def _run_nextclade_cli( | |||
nextclade_cli_version: str, nextclade_command: list[str], output_file: Path, input_files: list[Path] | None = None | |||
nextclade_cli_version: str, nextclade_command: list[str], output_path: Path, input_files: list[Path] | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chipping away at #52. The docker run
command only needs to know about the output_path (for volume mounting). The actual file nam efor Nextclade CLI output is specified in nextclade_command
src/cladetime/cladetime.py
Outdated
assigned_clades = sequence_metadata.join( | ||
assigned_clades.lazy(), left_on="strain", right_on="seqName", how="left" | ||
) | ||
return assigned_clades |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future commits change the return value to an object that includes summarized clade counts as well as the line file
@@ -258,7 +260,12 @@ def filter_metadata( | |||
|
|||
|
|||
def get_clade_counts(filtered_metadata: pl.LazyFrame) -> pl.LazyFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left this here because the variant-nowcast-hub scripts still reference it. It's replacement (summarize clades
) does the same thing but:
- has a better name
- allows a configurable list of group_by columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved per pair review
Closes #53
Background
This PR adds an
assign_clade
method to the CladeTime class. The new method uses CladeTime'ssequence_as_of
andtree_as_of
dates to invoke Nextclade's assignment process using the correct tree/dataset.The PR is large, mostly to expedite this feature, which is needed for the variant nowcast hub. That said, it can be reviewed commit-by-commit, which should help.
assign_clade
returns an object that contains three attributes:meta
: information about the clade assignment processdetail
: a Polars LazyFrame that contains line-by-line clade assignmentssummary
: a Polars LazyFrame that summarizes clade counts by location, clade, and dateOnce this PR is merged,
cladetime
will have enough functionality to generate target metadata for the variant nowcast hub.Testing
This PR prioritizes clade assignments that will be run on a scheduled basis (i.e., with the assumption that we'll have sufficient compute and memory resources). Even on a powerful laptop, it's not a good idea to assign more than a ~month's worth of sequence collection dates, as the clade_assignment process is intensive.
Regardless of the collection dates used, assign_clades still needs to download Nextstrain's entire sequence .fasta file before filtering it, so the code below will be slow depending on how good your connection is. Tackling #55 should speed up the download somewhat.
To run the end-to-end clade assignment using an older reference tree against current sequences (this is a nonsensical example):
Known usability improvements (for addressing later)