Add assign_clade method to CladeTime class #57

bsweger · 2024-11-12T19:02:23Z

Closes #53

Background

This PR adds an assign_clade method to the CladeTime class. The new method uses CladeTime's sequence_as_of and tree_as_of dates to invoke Nextclade's assignment process using the correct tree/dataset.

The PR is large, mostly to expedite this feature, which is needed for the variant nowcast hub. That said, it can be reviewed commit-by-commit, which should help.

assign_clade returns an object that contains three attributes:

meta: information about the clade assignment process
detail: a Polars LazyFrame that contains line-by-line clade assignments
summary: a Polars LazyFrame that summarizes clade counts by location, clade, and date

Once this PR is merged, cladetime will have enough functionality to generate target metadata for the variant nowcast hub.

Testing

This PR prioritizes clade assignments that will be run on a scheduled basis (i.e., with the assumption that we'll have sufficient compute and memory resources). Even on a powerful laptop, it's not a good idea to assign more than a ~month's worth of sequence collection dates, as the clade_assignment process is intensive.

Regardless of the collection dates used, assign_clades still needs to download Nextstrain's entire sequence .fasta file before filtering it, so the code below will be slow depending on how good your connection is. Tackling #55 should speed up the download somewhat.

To run the end-to-end clade assignment using an older reference tree against current sequences (this is a nonsensical example):

from cladetime import CladeTime, sequence

ct = CladeTime(tree_as_of="2024-08-02")
filtered_metadata = sequence.filter_metadata(
    ct.sequence_metadata,
    collection_min_date="2024-07-01", 
    collection_max_date="2024-07-01"
)

assignments = ct.assign_clades(filtered_metadata)
assignments.meta
# {'sequence_as_of': datetime.datetime(2024, 11, 12, 18, 9, 19, tzinfo=datetime.timezone.utc),
# 'tree_as_of': datetime.datetime(2024, 8, 2, 0, 0, tzinfo=datetime.timezone.utc),
# 'nextclade_dataset_version': '2024-07-17--12-57-03Z',
#  'nextclade_dataset_name': 'SARS-CoV-2',
# 'nextclade_version_num': '3.8.2',
#  'assignment_as_of': '2024-11-12 18:19'}

assignments.summary.collect()
shape: (94, 6)
# ┌──────────┬────────────┬──────────────┬──────────────────┬─────────┬───────┐
# │ location ┆ date       ┆ host         ┆ clade_nextstrain ┆ country ┆ count │
# │ ---      ┆ ---        ┆ ---          ┆ ---              ┆ ---     ┆ ---   │
# │ str      ┆ date       ┆ str          ┆ str              ┆ str     ┆ u32   │
# ╞══════════╪════════════╪══════════════╪══════════════════╪═════════╪═══════╡
# │ NM       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24C              ┆ USA     ┆ 4     │
# │ DE       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24C              ┆ USA     ┆ 4     │
# │ VA       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24B              ┆ USA     ┆ 2     │
# │ SD       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24B              ┆ USA     ┆ 2     │
# │ MO       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24B              ┆ USA     ┆ 1     │
# │ …        ┆ …          ┆ …            ┆ …                ┆ …       ┆ …     │
# │ MN       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24C              ┆ USA     ┆ 6     │
# │ MD       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24B              ┆ USA     ┆ 1     │
# │ MD       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24A              ┆ USA     ┆ 2     │
# │ MI       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24A              ┆ USA     ┆ 3     │
# │ MO       ┆ 2024-07-01 ┆ Homo sapiens ┆ 24C              ┆ USA     ┆ 1     │
# └──────────┴────────────┴──────────────┴──────────────────┴─────────┴───────┘

assignments.detail.collect().count()
shape: (1, 93)
# ┌─────────┬──────┬────────┬──────┬───┬──────────────────┬─────────────┬──────────┬────────┐
# │ country ┆ date ┆ strain ┆ host ┆ … ┆ pcrPrimerChanges ┆ failedCdses ┆ warnings ┆ errors │
# │ ---     ┆ ---  ┆ ---    ┆ ---  ┆   ┆ ---              ┆ ---         ┆ ---      ┆ ---    │
# │ u32     ┆ u32  ┆ u32    ┆ u32  ┆   ┆ u32              ┆ u32         ┆ u32      ┆ u32    │
# ╞═════════╪══════╪════════╪══════╪═══╪══════════════════╪═════════════╪══════════╪════════╡
# │ 518     ┆ 518  ┆ 518    ┆ 518  ┆ … ┆ 0                ┆ 0           ┆ 0        ┆ 0      │
# └─────────┴──────┴────────┴──────┴───┴──────────────────┴─────────────┴──────────┴────────┘

assignments.detail.select(["strain", "clade_nextstrain", "date", "location"]).collect()
shape: (518, 4)
# ┌───────────────────────────────┬──────────────────┬────────────┬──────────┐
# │ strain                        ┆ clade_nextstrain ┆ date       ┆ location │
# │ ---                           ┆ ---              ┆ ---        ┆ ---      │
# │ str                           ┆ str              ┆ date       ┆ str      │
# ╞═══════════════════════════════╪══════════════════╪════════════╪══════════╡
# │ USA/2024CV1154/2024           ┆ 24A              ┆ 2024-07-01 ┆ AZ       │
# │ USA/2024CV1156/2024           ┆ 24C              ┆ 2024-07-01 ┆ AZ       │
# │ USA/2024CV1161/2024           ┆ 24B              ┆ 2024-07-01 ┆ AZ       │
# │ USA/2024CV1162/2024           ┆ 24C              ┆ 2024-07-01 ┆ AZ       │
# │ USA/2024CV1164/2024           ┆ 24C              ┆ 2024-07-01 ┆ AZ       │
# │ …                             ┆ …                ┆ …          ┆ …        │
# │ USA/WV-CDC-LC1108206/2024     ┆ 24A              ┆ 2024-07-01 ┆ WV       │
# │ humans/USA/WA-PHL-035342/2024 ┆ 24B              ┆ 2024-07-01 ┆ WA       │
# │ humans/USA/WA-PHL-035344/2024 ┆ 24B              ┆ 2024-07-01 ┆ WA       │
# │ humans/USA/WA-PHL-035345/2024 ┆ 24B              ┆ 2024-07-01 ┆ WA       │
# │ humans/USA/WA-PHL-035348/2024 ┆ 24C              ┆ 2024-07-01 ┆ WA       │
# └───────────────────────────────┴──────────────────┴────────────┴──────────┘

Known usability improvements (for addressing later)

Tidy up cladetime logs and warning formats #56
Optimize .fasta sequence IO #55
Cladetime currently isn't doing a robust job of cleaning up after itself. The nextclade CLI writes data to disk, and cladetime returns it as a polars frame without cleaning up the nextclade output. Also, Simplify path/file handling and Docker mounts when assigning clades #52

Since it's possible to mix and match sequence_as_of and tree_as_of dates in cladetime, sequences and reference trees may have different ncov_metadata attributes (dataset version, nexclade cli version, for example) Add an ncov_metadata property to Tree that reflects metadata for the tree_as_of date (as opposed to CladeTime's ncov_metadata property, which reflects sequence_as_of). We'll use this new property to make sure we're using the correct nextclade dataset when assigning clades.

Still in the NCBI mindset, earlier versions of sequence.filter used accession numbers to compare .fasta records to a set of sequence "ids". However, for the processed Nextstrain sequences, we need to use the "strain" column

We will need to instantiate a Tree object from CladeTime when assigning clade sequences. Thus, we shouldn't use CladeTime objects to do this because circulate dependencies

Adding these parameters allows additional filtering on sequence metadata for min and max collection dates. This is in support of clade assignemnts, where we'll only want to assign clades to a small subset of sequences based on their collection date. Behavior is unchanged if these new parameters aren't specified.

This will allow re-use of that function when working with collection begin/end dates in sequence assignment Additional test cases for date commit

This new method is how clade time users (including people using the upcoming CLI) will do custom clade assignments. After validating dates, assign_clades calls out to existing functions, performing a kind of "mini pipeline" to return a LazyFrame with the results from Nextclade merged with metdata from the sequences being assigned.

This changeset represents new tests for the assign_clades method, as well as updates that reflect some refactoring that occurred along the way.

This changeset returns a summarized version of the clade assignments as well as some metadata about the clade assignment process.

bsweger · 2024-11-12T19:19:02Z

src/cladetime/tree.py

-        self._nextclade_data_url = self._clade_time._config.nextclade_data_url
-        self._nextclade_data_url_version = self._clade_time._config.nextclade_data_url_version
-        self._tree_name = self._clade_time._config.nextclade_input_tree_name
+        self._config = self._clade_time._config


Note that future commits remove CladeTime from Tree instantiation to avoid circular dependencies, so all of the awkward references to _clade_time_config and the like will disappear

bsweger · 2024-11-12T19:22:18Z

src/cladetime/tree.py

@@ -125,14 +148,6 @@ def _get_tree_url(self):
        )
        return tree_url

-    def _get_url_ncov_metadata(self):


No longer needed because we're setting the Tree ncov metadata information the same way we set the CladeTime ncov metadata property

bsweger · 2024-11-12T19:28:37Z

src/cladetime/util/reference.py

@@ -71,7 +71,7 @@ def _get_s3_object_url(bucket_name: str, object_key: str, date: datetime) -> Tup


 def _run_nextclade_cli(
-    nextclade_cli_version: str, nextclade_command: list[str], output_file: Path, input_files: list[Path] | None = None
+    nextclade_cli_version: str, nextclade_command: list[str], output_path: Path, input_files: list[Path] | None = None


Chipping away at #52. The docker run command only needs to know about the output_path (for volume mounting). The actual file nam efor Nextclade CLI output is specified in nextclade_command

bsweger · 2024-11-12T19:34:00Z

src/cladetime/cladetime.py

+        assigned_clades = sequence_metadata.join(
+            assigned_clades.lazy(), left_on="strain", right_on="seqName", how="left"
+        )
+        return assigned_clades


Future commits change the return value to an object that includes summarized clade counts as well as the line file

bsweger · 2024-11-12T19:36:33Z

src/cladetime/sequence.py

@@ -258,7 +260,12 @@ def filter_metadata(


 def get_clade_counts(filtered_metadata: pl.LazyFrame) -> pl.LazyFrame:


Left this here because the variant-nowcast-hub scripts still reference it. It's replacement (summarize clades) does the same thing but:

has a better name

allows a configurable list of group_by columns

matthewcornell

approved per pair review

bsweger added 13 commits November 6, 2024 16:02

Use "strain" as the id for filtering sequences

d516677

Still in the NCBI mindset, earlier versions of sequence.filter used accession numbers to compare .fasta records to a set of sequence "ids". However, for the processed Nextstrain sequences, we need to use the "strain" column

Make the integration test_file_path fixture shareable

477e01e

Simplify path handling when interacting with docker

b03f01e

Fix circular import / change the signature of Tree

049ec1b

We will need to instantiate a Tree object from CladeTime when assigning clade sequences. Thus, we shouldn't use CladeTime objects to do this because circulate dependencies

Move date validation function out of cladetime.py

38c384d

This will allow re-use of that function when working with collection begin/end dates in sequence assignment Additional test cases for date commit

clean up unused config fields

ed4dc43

Add tests for the new CladeTime assign_clades method

c148db3

This changeset represents new tests for the assign_clades method, as well as updates that reflect some refactoring that occurred along the way.

Update the return value of assign_clades

358f759

This changeset returns a summarized version of the clade assignments as well as some metadata about the clade assignment process.

Run integration tests more frequently

7c2aa5e

Fix readthedocs build error

c9c09d1

bsweger force-pushed the bsweger/assign-clades-method/53 branch from 1fc1354 to c9c09d1 Compare November 12, 2024 19:15

bsweger commented Nov 12, 2024

View reviewed changes

bsweger requested review from elray1 and matthewcornell November 12, 2024 20:17

matthewcornell approved these changes Nov 12, 2024

View reviewed changes

bsweger merged commit 7e17fa2 into main Nov 13, 2024
2 checks passed

bsweger deleted the bsweger/assign-clades-method/53 branch November 13, 2024 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add assign_clade method to CladeTime class #57

Add assign_clade method to CladeTime class #57

bsweger commented Nov 12, 2024

bsweger Nov 12, 2024

bsweger Nov 12, 2024

bsweger Nov 12, 2024

bsweger Nov 12, 2024

bsweger Nov 12, 2024

matthewcornell left a comment

		@@ -258,7 +260,12 @@ def filter_metadata(


		def get_clade_counts(filtered_metadata: pl.LazyFrame) -> pl.LazyFrame:

Add assign_clade method to CladeTime class #57

Add assign_clade method to CladeTime class #57

Conversation

bsweger commented Nov 12, 2024

Background

Testing

Known usability improvements (for addressing later)

bsweger Nov 12, 2024

Choose a reason for hiding this comment

bsweger Nov 12, 2024

Choose a reason for hiding this comment

bsweger Nov 12, 2024

Choose a reason for hiding this comment

bsweger Nov 12, 2024

Choose a reason for hiding this comment

bsweger Nov 12, 2024

Choose a reason for hiding this comment

matthewcornell left a comment

Choose a reason for hiding this comment