Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bsweger/get clades list #18

Merged
merged 3 commits into from
Sep 9, 2024
Merged

Bsweger/get clades list #18

merged 3 commits into from
Sep 9, 2024

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Sep 9, 2024

Resolves #13
Also see comments here about why I deviated from the original plan of putting this function in the hub itself.

This PR continues the work to port @rogersbw's "clades to model" process into virus-clade-utils:

  • Adds one more supporting function to sequence.py
  • Adds get_clade_list.py, which returns a list of clades to model, based on 3 criteria:
    • threshold (proportion)
    • threshold weeks
    • max number of clades for the resulting list
  • Adds test cases for above (these will require careful review)

Note that get_clade_list.py returns "clades to model" as a Python list, with the assumption that it will be serialized and written to disk by the calling process (e.g., the variant nowcast hub). We'll need to think about that more, since that hub's "make round config" code is written in R.

There are a few changes (for example, sorting by date and clade
before slicing on the "n" clades that create the final list).
This changeset also adds test cases for various permutations
of threshold, threshold_weeks, and the maximum number of clades
allowed in the list being returned.
# FIXME: provide ability to instantiate Config for the get_clade_list function and get the data_path from there
def main(
genome_metadata_path: AnyPath = Config.nextstrain_latest_genome_metadata,
data_dir: AnyPath = AnyPath(".").home() / "covid_variant",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with @rogersbw original code, we're saving the Nextstrain metadata to disk so we can work with it locally. I'm assuming that this is an internal concern (for example, we don't need to surface the download to whatever process runs get_clade_list and could change this implementation detail if needed).

Please let me know if that assumption is incorrect!

def main(
genome_metadata_path: AnyPath = Config.nextstrain_latest_genome_metadata,
data_dir: AnyPath = AnyPath(".").home() / "covid_variant",
threshold: float = 0.01,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These defaults are from the original clades_to_model code: https://github.com/rogersbw/clade_data_utils/blob/main/utility/data_utility.py#L79

@rogersbw
Copy link
Collaborator

rogersbw commented Sep 9, 2024

This all looks good to me. Merging!

@rogersbw rogersbw merged commit 4e1da08 into main Sep 9, 2024
1 check passed
@bsweger bsweger deleted the bsweger/get_clades_list branch September 10, 2024 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move data_utils functions to virus-clade-utils
2 participants