retrieve_data.py
incorrectly reports success when failing to retrieve files from HPSS under certain conditions
#1165
Labels
bug
Something isn't working
Expected behavior
When
retrieve_data.py
fails to retrieve files from HPSS, it should report a failure (or move on to the next data store, if applicable).Current behavior
Under certain conditions,
retrieve_data.py
can fail to retrieve data from HPSS, yet still report success.This problem is actually a quite complicated and convoluted mess of interacting logic between different functions, and it appears as if it is only through compensating errors that the script reports success at all!
Part 1:
htar
's baffling success conditionsI believe the inciting incident and therefore ultimate fault of this problem is the frankly baffling decision from the
htar
utility to always report success when querying files from an existing.tar
archive, even if those files you specified don't exist! For example:🥴 what?? You were able to find zero files (the
-t
flag simply lists the files, doesn't attempt to extract them), yet you explicitly reported success? I could maybe forgive this behavior as some weird edge case for some weird legacy reason, but unforgivably this same behavior happens when explicitly attempting to extract these files:Oh, "HTAR SUCCESSFUL" you say? Well what a wonderful pat on the back for you
htar
! I happen to disagree with your definition of success, given that you didn't do anything!Part 2:
glob.glob
So now we know that sometimes we can do a whole bunch of "successful" operations but have nothing to show for it. This becomes a problem later on in the logic of the script, when we come to a function called
clean_up_output_dir
. This function is two-pronged: primarily it takes the output fromhtar
, which may be in various levels of subdirectories, and moves it to theoutput_path
as specified by the user on the command line, deleting the subdirectories as they are emptied. The secondary purpose behind this function is that it serves to check that we actually extracted all the files we wanted to. So for each inputsource_path
to this function (the path within the tarball of the expected file, and therefore the expected relative location after extraction), it checks whether we have the expected data in that path, and moves it to its final destination.So, great, this logic should catch the problems that
htar
tricked us into not catching before, what's the issue? Well, unfortunately the logic makes a fatal assumption, in that it inherently assumes that the data actually exists before ever actually checking! This comes to bear in an innocent-looking loop that attempts to expand any wildcards in the input source paths:The function then loops over the list
expand_source_paths
and checks that all those files were properly downloaded.Now, those of you who are still reading and haven't skipped to the bottom may have noticed something
Well shoot. If the
glob.glob
call expands to an empty list (which is the case when the file does not exist), then nothing gets added to the "new" list for checking dataexpand_source_paths
, and so subsequent logic never catches this missing data.Still, it's no problem. If we fix this programming oversight so that empty globs are properly noted as "unavailable", surely our problem will be fixed.......
Part 3: Narrator: "The problem was not fixed"
If we zoom out and look at
clean_up_output_dir()
in context, it turns out it's the last step of a loop that checks all of the archives provided for the files provided. And it doesn't just do anything with that "unavailable" return value, it stores it in a dictionary where the keys are the value of "existing_archive", which is the path of the archive tar file on HPSS.The logic here does attempt to take into account that we might need to get different files from different archives: for each tar file found, it notes the files it couldn't find, storing them in a set (for those who were unfamiliar like me, a Python set is a data structure just like a dictionary, but with only keys and no values) as a value in a dictionary for which the key is the name of the tar file. Then as it loops through the archives, it will have a dictionary of unavailable sets of files, and calculates the union of those sets and compares it to a set of expected files. If the union equals the original expected set, then each expected file was only "missing" once, and no unexpected missing files were found, so we found all the files...I guess?
...but that logic seems to only work some of the time. For example, if you have only one hpss archive file to look in, and all the files you search for are unavailable, then there will be no differences between the "expected" and "unavailable" sets, and the script will treat this as a success!
The solution to this problem may be pretty straightforward: instead of tracking the files we didn't find, just track the files we find. That way if there are any left un-found at the end of the loop, it's straightforward to understand what that represents. To be fully honest, I have yet to fully wrap my head around the original logic here, so it's possible that I'm missing some functionality of the original logic, and some complication in my proposed solution.
Machines affected
Any machine with HPSS access. This problem does not seem to affect observation pulling from other sources, such as AWS.
Steps To Reproduce
I have encountered this a few times on occasion for various combinations of obs and dates, but I can not find a "working" example of this bug in the current code from a quick search. This has been affecting me on a development branch for smoke verification, which is what prompted me to finally open an issue. But the problem is easily replicated by adding an entry to
data_locations.yml
where the HPSS archive name exists but the file within the archive does not:parm/data_locations.yml
to change the "file_names:" line under an HPSS archive to a filename that does not exist in the archive. In this example, I have modified theCCPA_obs:
entry as follows:retrieve_data.py
to retrieve files from that modified entry. In my example, I used this command to put the output in a directory namedoutput/
:output/
contains no files, despite the script reporting success.Additional Information
I thought I had a fix for this a while back, but it just led me down further rabbit holes of failures and so I decided to stop and make this an issue without a solution. I hope that detailing this issue as I understand it will make it easier for someone else to fix it.
Output
Nothing relevant, so here's a cute cat for those of you who skipped straight to the bottom
The text was updated successfully, but these errors were encountered: