-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bsweger/metadata download as of #23
Conversation
The config object was originally designed to be used with the script that assigns sequences to clades. We still have some work to do to make the Config less specific to that original usage, but to start with, let's make the data directory optional and set it to a default value.
This changeset allows Nextstrain genome sequence metadata downloads to get the S3 file version with the most recent modified date that is less than the optional "as_of" date parameter. If no "as_of" date is supplied, download the most recent version of the file.
For clearer unit testing
@@ -68,7 +69,8 @@ def get_clades(clade_counts: pl.LazyFrame, threshold: float, threshold_weeks: in | |||
|
|||
# FIXME: provide ability to instantiate Config for the get_clade_list function and get the data_path from there | |||
def main( | |||
genome_metadata_path: AnyPath = Config.nextstrain_latest_genome_metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than download the metadata file from the URL listed on Nextstrain's website, we'll download the file via an S3 https link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reference.py
is no longer a sensible name for this file, but I chose not to pull that thread as part of this update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all great! I asked questions in scattered places that I'll sum up here:
- Suppose it's 3am eastern time/7am UTC and we go to get the latest available data from nextclade. Their latest data update job completed at ~1 or 2 am UTC. I provide no argument or the current date for
as_of
, and I think this gets translated to 12:00:00 am UTC, which is prior to today's data update job, and so I end up pulling yesterday's data. Instead, should we translate a provided as_of date to 23:59:59 to get the last data that was available that day? - I think this is less critical for us, but... suppose Nextstrain runs two data updates on the same date. Do we get the last of them? I think this has to do with the comparison
version_date > selected_version["LastModified"]
, and we will get the last one if the last modified field is a datetime.
Additionally, I noticed that the automated unit test runs failed, I didn't dig into why.
For a versioned, public S3 bucket and object key, return the version ID | ||
of the object as it existed at a specific date (UTC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple versions are stored on the same date, will this be the first or last of those?
raise e | ||
|
||
if selected_version is None: | ||
raise ValueError(f"No version of {object_key} found before {date}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError(f"No version of {object_key} found before {date}") | |
raise ValueError(f"No version of {object_key} found on or before {date}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, i'm not sure about this
as_of_datetime = datetime.strptime(as_of, "%Y-%m-%d").replace(tzinfo=timezone.utc) | ||
|
||
(s3_version, s3_url) = get_s3_object_url(bucket, key, as_of_datetime) | ||
filename = data_path / f"{as_of_datetime.date().strftime("%Y-%m-%d")}-{Path(key).name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just use the provided as_of
string here?
filename = data_path / f"{as_of_datetime.date().strftime("%Y-%m-%d")}-{Path(key).name}" | |
filename = data_path / f"{as_of}-{Path(key).name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion--I'll hold off, given that we're planning to rejig things and start passing timestamps around!
@pytest.fixture | ||
def s3_setup(): | ||
"""Setup mock S3 bucket with versioned objects.""" | ||
with mock_aws(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly, this is pretty dazzling!! ✨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks--I fixed up the dependencies so the tests, you know, actually run!
session = get_session() | ||
filename = data_path / Path(url).name | ||
if as_of is None: | ||
as_of_datetime = datetime.now().replace(tzinfo=timezone.utc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need to add 1 day (or 23 hours, 59 minutes, 59 seconds?) here to ensure that we get the latest available as of right now. I think this returns midnight (first second of the day) of today, which may be prior to a Nextstrain data run that happened at e.g. 2am UTC?
selected_version = None | ||
for page in page_iterator: | ||
for version in page.get("Versions", []): | ||
version_date = version["LastModified"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to question above -- is version_date
a date or a datetime?
if as_of is None: | ||
as_of_datetime = datetime.now().replace(tzinfo=timezone.utc) | ||
else: | ||
as_of_datetime = datetime.strptime(as_of, "%Y-%m-%d").replace(tzinfo=timezone.utc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar -- do we want midnight of that day, or just before midnight of the next day?
I was being lazy so you could use feature, but yes, we should operating with datetime instead of date. I'll work on that.
Yes, we will pull the most recent modified, which I think is what we want (let me know if you disagree). I'll update the tests to make this case more explicit.
Silly oversight on my part--fixed! |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved
Resolves #12
This PR modifies the
download_covid_genome_metadata
function inutils/sequence.py
to accept an as_of parameter.The "manual" usage is a little awkward because we're pulling the Nextstrain S3 information from a Config file that desperately needs a refactor (see individual commit messages).
Example usage: