Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add assign_clade method to CladeTime class #57

Merged
merged 13 commits into from
Nov 13, 2024

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Nov 12, 2024

Closes #53

Background

This PR adds an assign_clade method to the CladeTime class. The new method uses CladeTime's sequence_as_of and tree_as_of dates to invoke Nextclade's assignment process using the correct tree/dataset.

The PR is large, mostly to expedite this feature, which is needed for the variant nowcast hub. That said, it can be reviewed commit-by-commit, which should help.

assign_clade returns an object that contains three attributes:

  1. meta: information about the clade assignment process
  2. detail: a Polars LazyFrame that contains line-by-line clade assignments
  3. summary: a Polars LazyFrame that summarizes clade counts by location, clade, and date

Once this PR is merged, cladetime will have enough functionality to generate target metadata for the variant nowcast hub.

Testing

This PR prioritizes clade assignments that will be run on a scheduled basis (i.e., with the assumption that we'll have sufficient compute and memory resources). Even on a powerful laptop, it's not a good idea to assign more than a ~month's worth of sequence collection dates, as the clade_assignment process is intensive.

Regardless of the collection dates used, assign_clades still needs to download Nextstrain's entire sequence .fasta file before filtering it, so the code below will be slow depending on how good your connection is. Tackling #55 should speed up the download somewhat.

To run the end-to-end clade assignment using an older reference tree against current sequences (this is a nonsensical example):

from cladetime import CladeTime, sequence

ct = CladeTime(tree_as_of="2024-08-02")
filtered_metadata = sequence.filter_metadata(
    ct.sequence_metadata,
    collection_min_date="2024-07-01", 
    collection_max_date="2024-07-01"
)

assignments = ct.assign_clades(filtered_metadata)
assignments.meta
# {'sequence_as_of': datetime.datetime(2024, 11, 12, 18, 9, 19, tzinfo=datetime.timezone.utc),
# 'tree_as_of': datetime.datetime(2024, 8, 2, 0, 0, tzinfo=datetime.timezone.utc),
# 'nextclade_dataset_version': '2024-07-17--12-57-03Z',
#  'nextclade_dataset_name': 'SARS-CoV-2',
# 'nextclade_version_num': '3.8.2',
#  'assignment_as_of': '2024-11-12 18:19'}

assignments.summary.collect()
shape: (94, 6)
# ┌──────────┬────────────┬──────────────┬──────────────────┬─────────┬───────┐
# │ location ┆ date       ┆ host         ┆ clade_nextstrain ┆ country ┆ count │
# │ ---      ┆ ---        ┆ ---          ┆ ---              ┆ ---     ┆ ---   │
# │ str      ┆ date       ┆ str          ┆ str              ┆ str     ┆ u32   │
# ╞══════════╪════════════╪══════════════╪══════════════════╪═════════╪═══════╡
# │ NM       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24C              ┆ USA     ┆ 4     │
# │ DE       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24C              ┆ USA     ┆ 4     │
# │ VA       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24B              ┆ USA     ┆ 2     │
# │ SD       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24B              ┆ USA     ┆ 2     │
# │ MO       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24B              ┆ USA     ┆ 1     │
# │ …        ┆ …          ┆ …            ┆ …                ┆ …       ┆ …     │
# │ MN       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24C              ┆ USA     ┆ 6     │
# │ MD       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24B              ┆ USA     ┆ 1     │
# │ MD       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24A              ┆ USA     ┆ 2     │
# │ MI       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24A              ┆ USA     ┆ 3     │
# │ MO       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24C              ┆ USA     ┆ 1     │
# └──────────┴────────────┴──────────────┴──────────────────┴─────────┴───────┘

assignments.detail.collect().count()
shape: (1, 93)
# ┌─────────┬──────┬────────┬──────┬───┬──────────────────┬─────────────┬──────────┬────────┐
# │ country ┆ date ┆ strain ┆ host ┆ … ┆ pcrPrimerChanges ┆ failedCdses ┆ warnings ┆ errors │
# │ ---     ┆ ---  ┆ ---    ┆ ---  ┆   ┆ ---              ┆ ---         ┆ ---      ┆ ---    │
# │ u32     ┆ u32  ┆ u32    ┆ u32  ┆   ┆ u32              ┆ u32         ┆ u32      ┆ u32    │
# ╞═════════╪══════╪════════╪══════╪═══╪══════════════════╪═════════════╪══════════╪════════╡
# │ 518     ┆ 518  ┆ 518    ┆ 518  ┆ … ┆ 0                ┆ 0           ┆ 0        ┆ 0      │
# └─────────┴──────┴────────┴──────┴───┴──────────────────┴─────────────┴──────────┴────────┘

assignments.detail.select(["strain", "clade_nextstrain", "date", "location"]).collect()
shape: (518, 4)
# ┌───────────────────────────────┬──────────────────┬────────────┬──────────┐
# │ strain                        ┆ clade_nextstrain ┆ date       ┆ location │
# │ ---                           ┆ ---              ┆ ---        ┆ ---      │
# │ str                           ┆ str              ┆ date       ┆ str      │
# ╞═══════════════════════════════╪══════════════════╪════════════╪══════════╡
# │ USA/2024CV1154/2024           ┆ 24A              ┆ 2024-07-01 ┆ AZ       │
# │ USA/2024CV1156/2024           ┆ 24C              ┆ 2024-07-01 ┆ AZ       │
# │ USA/2024CV1161/2024           ┆ 24B              ┆ 2024-07-01 ┆ AZ       │
# │ USA/2024CV1162/2024           ┆ 24C              ┆ 2024-07-01 ┆ AZ       │
# │ USA/2024CV1164/2024           ┆ 24C              ┆ 2024-07-01 ┆ AZ       │
# │ …                             ┆ …                ┆ …          ┆ …        │
# │ USA/WV-CDC-LC1108206/2024     ┆ 24A              ┆ 2024-07-01 ┆ WV       │
# │ humans/USA/WA-PHL-035342/2024 ┆ 24B              ┆ 2024-07-01 ┆ WA       │
# │ humans/USA/WA-PHL-035344/2024 ┆ 24B              ┆ 2024-07-01 ┆ WA       │
# │ humans/USA/WA-PHL-035345/2024 ┆ 24B              ┆ 2024-07-01 ┆ WA       │
# │ humans/USA/WA-PHL-035348/2024 ┆ 24C              ┆ 2024-07-01 ┆ WA       │
# └───────────────────────────────┴──────────────────┴────────────┴──────────┘

Known usability improvements (for addressing later)

Since it's possible to mix and match sequence_as_of and
tree_as_of dates in cladetime, sequences and reference
trees may have different ncov_metadata attributes
(dataset version, nexclade cli version, for example)
Add an ncov_metadata property to Tree that reflects
metadata for the tree_as_of date (as opposed to
CladeTime's ncov_metadata property, which reflects
sequence_as_of).

We'll use this new property to make sure we're using
the correct nextclade dataset when assigning
clades.
Still in the NCBI mindset, earlier versions of sequence.filter
used accession numbers to compare .fasta records to a set
of sequence "ids". However, for the processed Nextstrain
sequences, we need to use the "strain" column
We will need to instantiate a Tree object from CladeTime
when assigning clade sequences. Thus, we shouldn't use
CladeTime objects to do this because circulate dependencies
Adding these parameters allows additional filtering on
sequence metadata for min and max collection dates. This
is in support of clade assignemnts, where we'll only
want to assign clades to a small subset of sequences based
on their collection date. Behavior is unchanged if these
new parameters aren't specified.
This will allow re-use of that function when working with
collection begin/end dates in sequence assignment

Additional test cases for date commit
This new method is how clade time users (including people
using the upcoming CLI) will do custom clade assignments.
After validating dates, assign_clades calls out to existing
functions, performing a kind of "mini pipeline" to return
a LazyFrame with the results from Nextclade merged with
metdata from the sequences being assigned.
This changeset represents new tests for the assign_clades
method, as well as updates that reflect some refactoring
that occurred along the way.
This changeset returns a summarized version of the clade
assignments as well as some metadata about the clade
assignment process.
@bsweger bsweger force-pushed the bsweger/assign-clades-method/53 branch from 1fc1354 to c9c09d1 Compare November 12, 2024 19:15
self._nextclade_data_url = self._clade_time._config.nextclade_data_url
self._nextclade_data_url_version = self._clade_time._config.nextclade_data_url_version
self._tree_name = self._clade_time._config.nextclade_input_tree_name
self._config = self._clade_time._config
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that future commits remove CladeTime from Tree instantiation to avoid circular dependencies, so all of the awkward references to _clade_time_config and the like will disappear

@@ -125,14 +148,6 @@ def _get_tree_url(self):
)
return tree_url

def _get_url_ncov_metadata(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed because we're setting the Tree ncov metadata information the same way we set the CladeTime ncov metadata property

@@ -71,7 +71,7 @@ def _get_s3_object_url(bucket_name: str, object_key: str, date: datetime) -> Tup


def _run_nextclade_cli(
nextclade_cli_version: str, nextclade_command: list[str], output_file: Path, input_files: list[Path] | None = None
nextclade_cli_version: str, nextclade_command: list[str], output_path: Path, input_files: list[Path] | None = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chipping away at #52. The docker run command only needs to know about the output_path (for volume mounting). The actual file nam efor Nextclade CLI output is specified in nextclade_command

assigned_clades = sequence_metadata.join(
assigned_clades.lazy(), left_on="strain", right_on="seqName", how="left"
)
return assigned_clades
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future commits change the return value to an object that includes summarized clade counts as well as the line file

@@ -258,7 +260,12 @@ def filter_metadata(


def get_clade_counts(filtered_metadata: pl.LazyFrame) -> pl.LazyFrame:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left this here because the variant-nowcast-hub scripts still reference it. It's replacement (summarize clades) does the same thing but:

  • has a better name
  • allows a configurable list of group_by columns

Copy link
Member

@matthewcornell matthewcornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved per pair review

@bsweger bsweger merged commit 7e17fa2 into main Nov 13, 2024
2 checks passed
@bsweger bsweger deleted the bsweger/assign-clades-method/53 branch November 13, 2024 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add assign_clades method to CladeTime
2 participants