Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bsweger/metadata download as of #23

Merged
merged 6 commits into from
Sep 27, 2024
Merged

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Sep 26, 2024

Resolves #12

This PR modifies the download_covid_genome_metadata function in utils/sequence.py to accept an as_of parameter.

The "manual" usage is a little awkward because we're pulling the Nextstrain S3 information from a Config file that desperately needs a refactor (see individual commit messages).

Example usage:

from datetime import datetime

from virus_clade_utils.util.config import Config
from virus_clade_utils.util.sequence import download_covid_genome_metadata

# sequence_date and reference_tree_as_of don't actually do anything here
# but are required for creating a Config instance (this is what needs a refactor)
# (the reason we're instantiating a config object is to get config.date_path)
sequence_date = reference_tree_as_of = datetime.now()
config = Config(sequence_date, reference_tree_as_of)

bucket = Config.nextstrain_ncov_bucket
key = Config.nextstrain_genome_metadata_key

metadata = download_covid_genome_metadata(
    bucket,
    key,
    config.data_path,
    # download the S3 file as of the date below (YYYY-MM-DD)
    # if you want the latest file, don't pass as_of
    as_of="2024-09-24",
    # if the file for above date already exists, don't re-download
    use_existing=True,
)

The config object was originally designed to be used with the
script that assigns sequences to clades. We still have some
work to do to make the Config less specific to that original
usage, but to start with, let's make the data directory
optional and set it to a default value.
This changeset allows Nextstrain genome sequence metadata
downloads to get the S3 file version with the most recent
modified date that is less than the optional "as_of" date
parameter. If no "as_of" date is supplied, download the
most recent version of the file.
@bsweger bsweger requested a review from elray1 September 26, 2024 20:57
@@ -68,7 +69,8 @@ def get_clades(clade_counts: pl.LazyFrame, threshold: float, threshold_weeks: in

# FIXME: provide ability to instantiate Config for the get_clade_list function and get the data_path from there
def main(
genome_metadata_path: AnyPath = Config.nextstrain_latest_genome_metadata,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than download the metadata file from the URL listed on Nextstrain's website, we'll download the file via an S3 https link.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference.py is no longer a sensible name for this file, but I chose not to pull that thread as part of this update

Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all great! I asked questions in scattered places that I'll sum up here:

  1. Suppose it's 3am eastern time/7am UTC and we go to get the latest available data from nextclade. Their latest data update job completed at ~1 or 2 am UTC. I provide no argument or the current date for as_of, and I think this gets translated to 12:00:00 am UTC, which is prior to today's data update job, and so I end up pulling yesterday's data. Instead, should we translate a provided as_of date to 23:59:59 to get the last data that was available that day?
  2. I think this is less critical for us, but... suppose Nextstrain runs two data updates on the same date. Do we get the last of them? I think this has to do with the comparison version_date > selected_version["LastModified"], and we will get the last one if the last modified field is a datetime.

Additionally, I noticed that the automated unit test runs failed, I didn't dig into why.

Comment on lines +52 to +53
For a versioned, public S3 bucket and object key, return the version ID
of the object as it existed at a specific date (UTC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple versions are stored on the same date, will this be the first or last of those?

raise e

if selected_version is None:
raise ValueError(f"No version of {object_key} found before {date}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError(f"No version of {object_key} found before {date}")
raise ValueError(f"No version of {object_key} found on or before {date}")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, i'm not sure about this

as_of_datetime = datetime.strptime(as_of, "%Y-%m-%d").replace(tzinfo=timezone.utc)

(s3_version, s3_url) = get_s3_object_url(bucket, key, as_of_datetime)
filename = data_path / f"{as_of_datetime.date().strftime("%Y-%m-%d")}-{Path(key).name}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just use the provided as_of string here?

Suggested change
filename = data_path / f"{as_of_datetime.date().strftime("%Y-%m-%d")}-{Path(key).name}"
filename = data_path / f"{as_of}-{Path(key).name}"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion--I'll hold off, given that we're planning to rejig things and start passing timestamps around!

@pytest.fixture
def s3_setup():
"""Setup mock S3 bucket with versioned objects."""
with mock_aws():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly, this is pretty dazzling!! ✨

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks--I fixed up the dependencies so the tests, you know, actually run!

session = get_session()
filename = data_path / Path(url).name
if as_of is None:
as_of_datetime = datetime.now().replace(tzinfo=timezone.utc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to add 1 day (or 23 hours, 59 minutes, 59 seconds?) here to ensure that we get the latest available as of right now. I think this returns midnight (first second of the day) of today, which may be prior to a Nextstrain data run that happened at e.g. 2am UTC?

selected_version = None
for page in page_iterator:
for version in page.get("Versions", []):
version_date = version["LastModified"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to question above -- is version_date a date or a datetime?

if as_of is None:
as_of_datetime = datetime.now().replace(tzinfo=timezone.utc)
else:
as_of_datetime = datetime.strptime(as_of, "%Y-%m-%d").replace(tzinfo=timezone.utc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar -- do we want midnight of that day, or just before midnight of the next day?

@bsweger
Copy link
Collaborator Author

bsweger commented Sep 27, 2024

This is all great! I asked questions in scattered places that I'll sum up here:

1. Suppose it's 3am eastern time/7am UTC and we go to get the latest available data from nextclade [snip]

I was being lazy so you could use feature, but yes, we should operating with datetime instead of date. I'll work on that.

2. I think this is less critical for us, but... suppose Nextstrain runs two data updates on the same date. Do we get the last of them?  I think this has to do with the comparison `version_date > selected_version["LastModified"]`, and we will get the last one if the last modified field is a datetime.

Yes, we will pull the most recent modified, which I think is what we want (let me know if you disagree). I'll update the tests to make this case more explicit.

Additionally, I noticed that the automated unit test runs failed, I didn't dig into why.

Silly oversight on my part--fixed!

@elray1
Copy link
Collaborator

elray1 commented Sep 27, 2024

  1. Thanks! I like the ability to use the feature.
  2. Agreed, this is what we want.

Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved

@bsweger
Copy link
Collaborator Author

bsweger commented Sep 27, 2024

Thanks @elray1! As discussed, here's the follow-up to ensure we address the timestamp concerns: #24

@bsweger bsweger merged commit 0b87882 into main Sep 27, 2024
1 check passed
@bsweger bsweger deleted the bsweger/metadata-download-as-of branch September 27, 2024 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a function to retrieve Nextclade ncov-ingest pipeline metadata at a point in time
2 participants