Bsweger/metadata download as of #23

bsweger · 2024-09-26T20:56:13Z

Resolves #12

This PR modifies the download_covid_genome_metadata function in utils/sequence.py to accept an as_of parameter.

The "manual" usage is a little awkward because we're pulling the Nextstrain S3 information from a Config file that desperately needs a refactor (see individual commit messages).

Example usage:

from datetime import datetime

from virus_clade_utils.util.config import Config
from virus_clade_utils.util.sequence import download_covid_genome_metadata

# sequence_date and reference_tree_as_of don't actually do anything here
# but are required for creating a Config instance (this is what needs a refactor)
# (the reason we're instantiating a config object is to get config.date_path)
sequence_date = reference_tree_as_of = datetime.now()
config = Config(sequence_date, reference_tree_as_of)

bucket = Config.nextstrain_ncov_bucket
key = Config.nextstrain_genome_metadata_key

metadata = download_covid_genome_metadata(
    bucket,
    key,
    config.data_path,
    # download the S3 file as of the date below (YYYY-MM-DD)
    # if you want the latest file, don't pass as_of
    as_of="2024-09-24",
    # if the file for above date already exists, don't re-download
    use_existing=True,
)

The config object was originally designed to be used with the script that assigns sequences to clades. We still have some work to do to make the Config less specific to that original usage, but to start with, let's make the data directory optional and set it to a default value.

This changeset allows Nextstrain genome sequence metadata downloads to get the S3 file version with the most recent modified date that is less than the optional "as_of" date parameter. If no "as_of" date is supplied, download the most recent version of the file.

For clearer unit testing

bsweger · 2024-09-27T13:30:45Z

src/virus_clade_utils/get_clade_list.py

@@ -68,7 +69,8 @@ def get_clades(clade_counts: pl.LazyFrame, threshold: float, threshold_weeks: in

 # FIXME: provide ability to instantiate Config for the get_clade_list function and get the data_path from there
 def main(
-    genome_metadata_path: AnyPath = Config.nextstrain_latest_genome_metadata,


Rather than download the metadata file from the URL listed on Nextstrain's website, we'll download the file via an S3 https link.

bsweger · 2024-09-27T13:32:01Z

src/virus_clade_utils/util/reference.py

reference.py is no longer a sensible name for this file, but I chose not to pull that thread as part of this update

elray1

This is all great! I asked questions in scattered places that I'll sum up here:

Suppose it's 3am eastern time/7am UTC and we go to get the latest available data from nextclade. Their latest data update job completed at ~1 or 2 am UTC. I provide no argument or the current date for as_of, and I think this gets translated to 12:00:00 am UTC, which is prior to today's data update job, and so I end up pulling yesterday's data. Instead, should we translate a provided as_of date to 23:59:59 to get the last data that was available that day?
I think this is less critical for us, but... suppose Nextstrain runs two data updates on the same date. Do we get the last of them? I think this has to do with the comparison version_date > selected_version["LastModified"], and we will get the last one if the last modified field is a datetime.

Additionally, I noticed that the automated unit test runs failed, I didn't dig into why.

elray1 · 2024-09-27T14:10:49Z

src/virus_clade_utils/util/reference.py

+    For a versioned, public S3 bucket and object key, return the version ID
+    of the object as it existed at a specific date (UTC)


If multiple versions are stored on the same date, will this be the first or last of those?

elray1 · 2024-09-27T14:32:49Z

src/virus_clade_utils/util/reference.py

+        raise e
+
+    if selected_version is None:
+        raise ValueError(f"No version of {object_key} found before {date}")


Suggested change

raise ValueError(f"No version of {object_key} found before {date}")

raise ValueError(f"No version of {object_key} found on or before {date}")

actually, i'm not sure about this

elray1 · 2024-09-27T14:41:50Z

src/virus_clade_utils/util/sequence.py

+        as_of_datetime = datetime.strptime(as_of, "%Y-%m-%d").replace(tzinfo=timezone.utc)
+
+    (s3_version, s3_url) = get_s3_object_url(bucket, key, as_of_datetime)
+    filename = data_path / f"{as_of_datetime.date().strftime("%Y-%m-%d")}-{Path(key).name}"


Could we just use the provided as_of string here?

Suggested change

filename = data_path / f"{as_of_datetime.date().strftime("%Y-%m-%d")}-{Path(key).name}"

filename = data_path / f"{as_of}-{Path(key).name}"

Good suggestion--I'll hold off, given that we're planning to rejig things and start passing timestamps around!

elray1 · 2024-09-27T14:45:36Z

tests/conftest.py

+@pytest.fixture
+def s3_setup():
+    """Setup mock S3 bucket with versioned objects."""
+    with mock_aws():


honestly, this is pretty dazzling!! ✨

Thanks--I fixed up the dependencies so the tests, you know, actually run!

elray1 · 2024-09-27T14:48:25Z

src/virus_clade_utils/util/sequence.py

-    session = get_session()
-    filename = data_path / Path(url).name
+    if as_of is None:
+        as_of_datetime = datetime.now().replace(tzinfo=timezone.utc)


I wonder if we need to add 1 day (or 23 hours, 59 minutes, 59 seconds?) here to ensure that we get the latest available as of right now. I think this returns midnight (first second of the day) of today, which may be prior to a Nextstrain data run that happened at e.g. 2am UTC?

elray1 · 2024-09-27T14:54:11Z

src/virus_clade_utils/util/reference.py

+        selected_version = None
+        for page in page_iterator:
+            for version in page.get("Versions", []):
+                version_date = version["LastModified"]


related to question above -- is version_date a date or a datetime?

elray1 · 2024-09-27T14:56:11Z

src/virus_clade_utils/util/sequence.py

+    if as_of is None:
+        as_of_datetime = datetime.now().replace(tzinfo=timezone.utc)
+    else:
+        as_of_datetime = datetime.strptime(as_of, "%Y-%m-%d").replace(tzinfo=timezone.utc)


similar -- do we want midnight of that day, or just before midnight of the next day?

bsweger · 2024-09-27T15:31:48Z

This is all great! I asked questions in scattered places that I'll sum up here:
1. Suppose it's 3am eastern time/7am UTC and we go to get the latest available data from nextclade [snip]

I was being lazy so you could use feature, but yes, we should operating with datetime instead of date. I'll work on that.

2. I think this is less critical for us, but... suppose Nextstrain runs two data updates on the same date. Do we get the last of them?  I think this has to do with the comparison `version_date > selected_version["LastModified"]`, and we will get the last one if the last modified field is a datetime.

Yes, we will pull the most recent modified, which I think is what we want (let me know if you disagree). I'll update the tests to make this case more explicit.

Additionally, I noticed that the automated unit test runs failed, I didn't dig into why.

Silly oversight on my part--fixed!

elray1 · 2024-09-27T15:40:17Z

Thanks! I like the ability to use the feature.
Agreed, this is what we want.

elray1

approved

bsweger · 2024-09-27T16:06:00Z

Thanks @elray1! As discussed, here's the follow-up to ensure we address the timestamp concerns: #24

bsweger added 3 commits September 25, 2024 12:13

Small refactor to pass session to download_covid_genome_metadata

4b0abcd

For clearer unit testing

bsweger requested a review from elray1 September 26, 2024 20:57

bsweger added 2 commits September 27, 2024 09:24

Add tests for new as_of functionality

a1fe96f

remove test file committed by mistake

5c99db0

bsweger commented Sep 27, 2024

View reviewed changes

elray1 reviewed Sep 27, 2024

View reviewed changes

Add overlooked python-mock dependency

e7c7359

elray1 approved these changes Sep 27, 2024

View reviewed changes

bsweger mentioned this pull request Sep 27, 2024

Incorporate timestamp when using an "as_of" date to get the VersionID of an S3 object #24

Open

2 tasks

bsweger merged commit 0b87882 into main Sep 27, 2024
1 check passed

bsweger deleted the bsweger/metadata-download-as-of branch September 27, 2024 16:06

bsweger mentioned this pull request Sep 27, 2024

Add S3 VersionId parameter to virus-clade-utils function that downloads sequence metadata from Nexstrain #22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bsweger/metadata download as of #23

Bsweger/metadata download as of #23

bsweger commented Sep 26, 2024

bsweger Sep 27, 2024

bsweger Sep 27, 2024

elray1 left a comment •

edited

Loading

elray1 Sep 27, 2024

elray1 Sep 27, 2024

elray1 Sep 27, 2024

elray1 Sep 27, 2024

bsweger Sep 27, 2024

elray1 Sep 27, 2024

bsweger Sep 27, 2024

elray1 Sep 27, 2024

elray1 Sep 27, 2024

elray1 Sep 27, 2024

bsweger commented Sep 27, 2024

elray1 commented Sep 27, 2024

elray1 left a comment

bsweger commented Sep 27, 2024

		For a versioned, public S3 bucket and object key, return the version ID
		of the object as it existed at a specific date (UTC)

	raise ValueError(f"No version of {object_key} found before {date}")
	raise ValueError(f"No version of {object_key} found on or before {date}")

	filename = data_path / f"{as_of_datetime.date().strftime("%Y-%m-%d")}-{Path(key).name}"
	filename = data_path / f"{as_of}-{Path(key).name}"

Bsweger/metadata download as of #23

Bsweger/metadata download as of #23

Conversation

bsweger commented Sep 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elray1 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsweger commented Sep 27, 2024

elray1 commented Sep 27, 2024

elray1 left a comment

Choose a reason for hiding this comment

bsweger commented Sep 27, 2024

elray1 left a comment •

edited

Loading