`retrieve_data.py` incorrectly reports success when failing to retrieve files from HPSS under certain conditions #1165

mkavulich · 2024-12-10T17:19:21Z

Expected behavior

When retrieve_data.py fails to retrieve files from HPSS, it should report a failure (or move on to the next data store, if applicable).

Current behavior

Under certain conditions, retrieve_data.py can fail to retrieve data from HPSS, yet still report success.

This problem is actually a quite complicated and convoluted mess of interacting logic between different functions, and it appears as if it is only through compensating errors that the script reports success at all!

Part 1: `htar`'s baffling success conditions

I believe the inciting incident and therefore ultimate fault of this problem is the frankly baffling decision from the htar utility to always report success when querying files from an existing .tar archive, even if those files you specified don't exist! For example:

$ htar -tvf /NCEPPROD/hpssprod/runhistory/rh2022/202207/20220722/dcom_20220722.tar ./airnow/HourlyAQObs_20220722*.dat ./airnow/Monitoring_Site_Locations_V2.dat
[connecting to hpsscore1.fairmont.rdhpcs.noaa.gov/1217]
HTAR: Listing complete for /NCEPPROD/hpssprod/runhistory/rh2022/202207/20220722/dcom_20220722.tar, 0 files 0 total objects
HTAR: HTAR SUCCESSFUL

🥴 what?? You were able to find zero files (the -t flag simply lists the files, doesn't attempt to extract them), yet you explicitly reported success? I could maybe forgive this behavior as some weird edge case for some weird legacy reason, but unforgivably this same behavior happens when explicitly attempting to extract these files:

$ htar -xvf /NCEPPROD/hpssprod/runhistory/rh2022/202207/20220722/dcom_20220722.tar ./airnow/HourlyAQObs_20220722*.dat ./airnow/Monitoring_Site_Locations_V2.dat
[connecting to hpsscore1.fairmont.rdhpcs.noaa.gov/1217]
HTAR: HTAR SUCCESSFUL
$ ls
$

Oh, "HTAR SUCCESSFUL" you say? Well what a wonderful pat on the back for you htar! I happen to disagree with your definition of success, given that you didn't do anything!

Part 2: `glob.glob`

just imagine they're saying "glob glob glob glob"

So now we know that sometimes we can do a whole bunch of "successful" operations but have nothing to show for it. This becomes a problem later on in the logic of the script, when we come to a function called clean_up_output_dir. This function is two-pronged: primarily it takes the output from htar, which may be in various levels of subdirectories, and moves it to the output_path as specified by the user on the command line, deleting the subdirectories as they are emptied. The secondary purpose behind this function is that it serves to check that we actually extracted all the files we wanted to. So for each input source_path to this function (the path within the tarball of the expected file, and therefore the expected relative location after extraction), it checks whether we have the expected data in that path, and moves it to its final destination.

So, great, this logic should catch the problems that htar tricked us into not catching before, what's the issue? Well, unfortunately the logic makes a fatal assumption, in that it inherently assumes that the data actually exists before ever actually checking! This comes to bear in an innocent-looking loop that attempts to expand any wildcards in the input source paths:

    for p in source_paths:
        expand_source_paths.extend(glob.glob(p.lstrip("/")))

The function then loops over the list expand_source_paths and checks that all those files were properly downloaded.

Now, those of you who are still reading and haven't skipped to the bottom may have noticed something

Well shoot. If the glob.glob call expands to an empty list (which is the case when the file does not exist), then nothing gets added to the "new" list for checking data expand_source_paths, and so subsequent logic never catches this missing data.

Still, it's no problem. If we fix this programming oversight so that empty globs are properly noted as "unavailable", surely our problem will be fixed.......

Part 3: Narrator: "The problem was not fixed"

If we zoom out and look at clean_up_output_dir() in context, it turns out it's the last step of a loop that checks all of the archives provided for the files provided. And it doesn't just do anything with that "unavailable" return value, it stores it in a dictionary where the keys are the value of "existing_archive", which is the path of the archive tar file on HPSS.

The logic here does attempt to take into account that we might need to get different files from different archives: for each tar file found, it notes the files it couldn't find, storing them in a set (for those who were unfamiliar like me, a Python set is a data structure just like a dictionary, but with only keys and no values) as a value in a dictionary for which the key is the name of the tar file. Then as it loops through the archives, it will have a dictionary of unavailable sets of files, and calculates the union of those sets and compares it to a set of expected files. If the union equals the original expected set, then each expected file was only "missing" once, and no unexpected missing files were found, so we found all the files...I guess?

...but that logic seems to only work some of the time. For example, if you have only one hpss archive file to look in, and all the files you search for are unavailable, then there will be no differences between the "expected" and "unavailable" sets, and the script will treat this as a success!

The solution to this problem may be pretty straightforward: instead of tracking the files we didn't find, just track the files we find. That way if there are any left un-found at the end of the loop, it's straightforward to understand what that represents. To be fully honest, I have yet to fully wrap my head around the original logic here, so it's possible that I'm missing some functionality of the original logic, and some complication in my proposed solution.

Machines affected

Any machine with HPSS access. This problem does not seem to affect observation pulling from other sources, such as AWS.

Steps To Reproduce

I have encountered this a few times on occasion for various combinations of obs and dates, but I can not find a "working" example of this bug in the current code from a quick search. This has been affecting me on a development branch for smoke verification, which is what prompted me to finally open an issue. But the problem is easily replicated by adding an entry to data_locations.yml where the HPSS archive name exists but the file within the archive does not:

Edit parm/data_locations.yml to change the "file_names:" line under an HPSS archive to a filename that does not exist in the archive. In this example, I have modified the CCPA_obs: entry as follows:

    file_names:
      obs:
        - "badfile.t*z.01h.hrap.conus.gb2"

Run retrieve_data.py to retrieve files from that modified entry. In my example, I used this command to put the output in a directory named output/:

./retrieve_data.py --file_set=obs --config=../parm/data_locations.yml --cycle_date=2022072100 --data_stores='hpss' --data_type=CCPA_obs --fcst_hrs 0 --output_path=output  --debug --summary_file=retrieve_data.log

Observe that output/ contains no files, despite the script reporting success.

Additional Information

I thought I had a fix for this a while back, but it just led me down further rabbit holes of failures and so I decided to stop and make this an issue without a solution. I hope that detailing this issue as I understand it will make it easier for someone else to fix it.

Output

Nothing relevant, so here's a cute cat for those of you who skipped straight to the bottom

The text was updated successfully, but these errors were encountered:

MichaelLueken · 2024-12-12T18:54:18Z

Given the level of detail that has been put into this issue, I'll link in previously opened issues #596 and #854, which were opened to address issues with get_extrn_lbcs/ics (respectively) reporting success even though no data was pulled from HPSS.

These two previous issues will be closed.

mkavulich added the bug Something isn't working label Dec 10, 2024

This was referenced Dec 12, 2024

get_extrn_lbcs task succeeds even if all files are not retrieved successfully #596

Closed

get_extrn_ics task spuriously succeeds without getting external model files #854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`retrieve_data.py` incorrectly reports success when failing to retrieve files from HPSS under certain conditions #1165

`retrieve_data.py` incorrectly reports success when failing to retrieve files from HPSS under certain conditions #1165

mkavulich commented Dec 10, 2024

MichaelLueken commented Dec 12, 2024

retrieve_data.py incorrectly reports success when failing to retrieve files from HPSS under certain conditions #1165

retrieve_data.py incorrectly reports success when failing to retrieve files from HPSS under certain conditions #1165

Comments

mkavulich commented Dec 10, 2024

Expected behavior

Current behavior

Part 1: htar's baffling success conditions

Part 2: glob.glob

Part 3: Narrator: "The problem was not fixed"

Machines affected

Steps To Reproduce

Additional Information

Output

MichaelLueken commented Dec 12, 2024

`retrieve_data.py` incorrectly reports success when failing to retrieve files from HPSS under certain conditions #1165

`retrieve_data.py` incorrectly reports success when failing to retrieve files from HPSS under certain conditions #1165

Part 1: `htar`'s baffling success conditions

Part 2: `glob.glob`