Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latest and nearresttime fail with missing directory on AWS #77

Open
csteele2 opened this issue Dec 4, 2023 · 6 comments
Open

latest and nearresttime fail with missing directory on AWS #77

csteele2 opened this issue Dec 4, 2023 · 6 comments

Comments

@csteele2
Copy link

csteele2 commented Dec 4, 2023

I know this is mostly an upstream issue, but I imagine this shouldn't be too much flexibility to support. The nearesttime and latest functions seem to fail if there is an hour missing from the directories on AWS.

For example, today, 4-Dec-2023 (day 338), the 18Z directory has not been created yet in noaa-goes16/ABI-L2-MCMIPF/2023/338, and the current time is 1946Z. This results in errors like
FileNotFoundError: noaa-goes16/ABI-L2-MCMIPF/2023/338/18
even when specifying a time like 1943Z on this day.

@csteele2
Copy link
Author

csteele2 commented Dec 7, 2023

This actually happens even without dataflow issues. The easiest way to replicate this is with full disk imagery just after the top of the hour, without any images in for that hour yet (thus that hour directory does not exist on AWS yet) - it will fail, citing the missing directory.

@vwgeiser
Copy link

vwgeiser commented Jan 25, 2024

Hey @csteele2 did you ever figure out a workaround for this? I am able to replicate the issue with the GOES Clear Sky Mask product.
Traceback says it occurs during the line df = G.df(start='2022-01-01 00:00', end='2022-1-31 23:56') when generating all the available files in a time range.

The 'noaa-goes16/ABI-L2-ACMC/2022/014/23' hour is empty and produces the error: FileNotFoundError: noaa-goes16/ABI-L2-ACMC/2022/014/23

while hour 22 'noaa-goes16/ABI-L2-ACMC/2022/014/22' didn't produce an error even though it is missing scans after minute 31

Did you find some way to check if a file exists first? My first thought is catching this as an exception?

Additionally if there is a way to return the number of missing scans in a time range (even sub hourly with when no error is thrown) that is also relevant to the cloud frequency problem I'm trying to solve.

@csteele2
Copy link
Author

csteele2 commented Jan 25, 2024

Not yet. I work around the top of the hour stuff by generating a list after I know the first file hits AWS. I was deciding on whether to hack up and try to fix goes2go, or go my own completely different way. What's pulling me out of goes2go is the day/night blending that is in satpy. I suppose I could give trying to come up with a way to fix this within goes2go, but it probably won't be soon. Probably more like spring. Tangentially related is incomplete downloads - it's pretty easily worked around when it happens, but ideally, something here would be included to redownload if it a partial file is returned (which happens to me very frequently).

@vwgeiser
Copy link

vwgeiser commented Jan 25, 2024

@csteele2

This seems quick and dirty but could be a possibility to skip over files (hours) that don't have any data in the AWS bucket. This seems like it will work for at least allowing bulk downloads to continue for the timerange function. Any suggestions for how to modify this to get how many scans within an hour are missing?

in data.py line 139

`

# List all files for each date
# ----------------------------
files = []
for DATE in DATES:
    # Test if the file exists on AWS S3
    try:
        files += fs.ls(f"{satellite}/{product}/{DATE:%Y/%j/%H/}", refresh=refresh)
    except FileNotFoundError:
        # If the file isn't found alert the user
        print(f"Files for hour {DATE} not present on AWS servers")
        continue

`

Input Example Case
ABI Clear Sky Mask Contiguous US
G = GOES(satellite=16, product="ABI-L2-ACMC", domain='C')

Download data for a specified time range
G.timerange(start='2022-01-14 00:00', end='2022-01-16 00:00')

Output for example case before change:
FileNotFoundError: noaa-goes16/ABI-L2-ACMC/2022/014/23

Output for example case after change:
Files for hour 2022-01-14 23:00:00 not present on AWS servers
Files for hour 2022-01-15 00:00:00 not present on AWS servers
Files for hour 2022-01-15 01:00:00 not present on AWS servers
📦 Finished downloading [517] files to [outputDirectory].

This doesn't fix/alert for partially filled buckets: for example 2022-01-15 02:00:00 only has scans for minutes 41,46,51,56

I apologize for formatting, new to github.

@wgustafson
Copy link

wgustafson commented Mar 5, 2024

I just ran into this same issue when trying to plot a week-long series of GOES images that have a data gap in the middle. I was going to implement a fix along the lines of the code from @vwgeiser in the prior comment, and then discovered this thread. Cutting and pasting his code solves the general problem when using goes_timerange. While it does not anticipate the specific missing files to report them one-by-one, at least noting the empty directory and not crashing is helpful. This saves having to go through the AWS file listing and figure out the missing hours to avoid them manually.

I am in favor of including the above code change.

@andregracioso
Copy link

andregracioso commented Aug 25, 2024

This is because the file with interval from last ten minutes from the hour, and created within first seconds from next hour is being placed in folder from the hour of its interval, not in the hour of its creation.

For example, the file:

OR_ABI-L2-FDCF-M6_G18_s20242382050213_e20242382059521_c20242382100018.nc

is within folder 20, but the code looks this file in folder 21, which does not exist yet. It will only be created ten minutes later (with the next file in sequence).

The code from @vwgeiser fixes it. Tested with .latest() (/cc @blaylockbk )

data.py

    files = []
    for DATE in DATES:
        try:
            files_in_date = fs.ls(f"{satellite}/{product}/{DATE:%Y/%j/%H/}", refresh=refresh)
            if len(files_in_date) > 0:
                files += files_in_date
        except:
            continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants