Bsweger/get clades list #18

bsweger · 2024-09-09T13:25:44Z

Resolves #13
Also see comments here about why I deviated from the original plan of putting this function in the hub itself.

This PR continues the work to port @rogersbw's "clades to model" process into virus-clade-utils:

Adds one more supporting function to sequence.py
Adds get_clade_list.py, which returns a list of clades to model, based on 3 criteria:
- threshold (proportion)
- threshold weeks
- max number of clades for the resulting list
Adds test cases for above (these will require careful review)

Note that get_clade_list.py returns "clades to model" as a Python list, with the assumption that it will be serialized and written to disk by the calling process (e.g., the variant nowcast hub). We'll need to think about that more, since that hub's "make round config" code is written in R.

There are a few changes (for example, sorting by date and clade before slicing on the "n" clades that create the final list). This changeset also adds test cases for various permutations of threshold, threshold_weeks, and the maximum number of clades allowed in the list being returned.

src/virus_clade_utils/get_clade_list.py

bsweger · 2024-09-09T13:38:11Z

src/virus_clade_utils/get_clade_list.py

+# FIXME: provide ability to instantiate Config for the get_clade_list function and get the data_path from there
+def main(
+    genome_metadata_path: AnyPath = Config.nextstrain_latest_genome_metadata,
+    data_dir: AnyPath = AnyPath(".").home() / "covid_variant",


As with @rogersbw original code, we're saving the Nextstrain metadata to disk so we can work with it locally. I'm assuming that this is an internal concern (for example, we don't need to surface the download to whatever process runs get_clade_list and could change this implementation detail if needed).

Please let me know if that assumption is incorrect!

bsweger · 2024-09-09T13:39:42Z

src/virus_clade_utils/get_clade_list.py

+def main(
+    genome_metadata_path: AnyPath = Config.nextstrain_latest_genome_metadata,
+    data_dir: AnyPath = AnyPath(".").home() / "covid_variant",
+    threshold: float = 0.01,


These defaults are from the original clades_to_model code: https://github.com/rogersbw/clade_data_utils/blob/main/utility/data_utility.py#L79

src/virus_clade_utils/util/sequence.py

tests/unit/test_get_clade_list.py

rogersbw · 2024-09-09T21:00:48Z

This all looks good to me. Merging!

bsweger added 3 commits September 6, 2024 14:45

add option to skip metadata download and use and existing file

8e6d803

add clade count function from clade_data_utils

9268d25