Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RTC SAS Failing on S1A_IW_SLC__1SDV_20230114T161231_20230114T161258_046782_059BD6_D15F #734

Open
niarenaw opened this issue Feb 1, 2024 · 4 comments
Assignees
Labels
bug Something isn't working must have needs triage Issue that requires triage pcm.r02 PCM Release 2

Comments

@niarenaw
Copy link

niarenaw commented Feb 1, 2024

Checked for duplicates

Yes - I've already checked

Describe the bug

The RTC is failing on the following granule: S1A_IW_SLC__1SDV_20230114T161231_20230114T161258_046782_059BD6_D15F

It looks like something in the SAS is failing:
Running preprocessor for RtcS1PreProcessorMixin Starting SAS execution for RtcS1Executor Traceback (most recent call last): File "/home/rtc_user/opera/scripts/pge_main.py", line 189, in <module> pge_main() File "/home/rtc_user/opera/scripts/pge_main.py", line 185, in pge_main pge_start(run_config_filename) File "/home/rtc_user/opera/scripts/pge_main.py", line 159, in pge_start pge.run() File "/home/rtc_user/opera/pge/base/base_pge.py", line 760, in run self.run_sas_executable(**kwargs) File "/home/rtc_user/opera/pge/base/base_pge.py", line 739, in run_sas_executable elapsed_time = time_and_execute( File "/home/rtc_user/opera/util/run_utils.py", line 214, in time_and_execute logger.critical(module_name, ErrorCode.SAS_PROGRAM_FAILED, error_msg) File "/home/rtc_user/opera/util/logger.py", line 407, in critical raise RuntimeError(description) RuntimeError: Command "/home/rtc_user/miniconda3/condabin/conda run --no-capture-output -n RTC rtc_s1.py /home/rtc_user/output_dir/scratch_dir/RunConfig_sas.yaml" failed with exit code 1 ERROR conda.cli.main_run:execute(47): conda run sh -c exec ${CONDA_ROOT}/bin/pge_docker_entrypoint.sh "${@}" -- --file /home/rtc_user/runconfig/RunConfig.yaml failed. (See above for error)

What did you expect?

Would expect the PGE to be able to process this granule

Reproducible steps

No response

Environment

Ran on PST venue using rc 2.1.1

Triaged job in GRQ: https://100.104.62.10/hysds_ui/tosca?dataset=%22triaged_job%22&_id=%22triaged_job-SCIFLO_L2_RTC_S1__2.1.1-S1A_IW_SLC__1SDV_20230114T161231_20230114T161258_046782_059BD6_D15F-r1-20240125T184212.5149Z_task-12054786-83a3-40f7-b4da-bcf1e0d294f3%22
@niarenaw niarenaw added bug Something isn't working needs triage Issue that requires triage labels Feb 1, 2024
@collinss-jpl collinss-jpl transferred this issue from nasa/opera-sds-pge Feb 2, 2024
@collinss-jpl
Copy link
Collaborator

Retested granule and RTC-S1 job still fails with following error from PGE/SAS log:

[StdErr] Traceback (most recent call last):
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/bin/rtc_s1_single_job.py", line 35, in <module>
[StdErr]     run_single_job(cfg)
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/lib/python3.9/site-packages/rtc/rtc_s1_single_job.py", line 1656, in run_single_job
[StdErr]     rg_lut, az_lut = compute_correction_lut(
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/lib/python3.9/site-packages/rtc/rtc_s1_single_job.py", line 243, in compute_correction_lut
[StdErr]     gdal.Open(f'{scratch_path}/height.rdr', gdal.GA_ReadOnly).ReadAsArray()
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/lib/python3.9/site-packages/osgeo/gdal.py", line 5296, in Open
[StdErr]     return _gdal.Open(*args)
[StdErr] RuntimeError: DSI or ACC record missing.  DTED access to
[StdErr] /home/rtc_user/scratch_dir/temp_1712164165.8587625/temp_1712164170.0907102/t160_342212_iw1//height.rdr failed.

This may indicate an issue with the underlying SLC granule.

@hhlee445
Copy link
Contributor

hhlee445 commented Jun 11, 2024

Retested granule on latest develop branch (6-11-24) and RTC-S2 job still fails with following error


Running preprocessor for RtcS1PreProcessorMixin
Starting SAS execution for RtcS1Executor
Traceback (most recent call last):
  File "/home/rtc_user/opera/scripts/pge_main.py", line 190, in <module>
    pge_main()
  File "/home/rtc_user/opera/scripts/pge_main.py", line 186, in pge_main
    pge_start(run_config_filename)
  File "/home/rtc_user/opera/scripts/pge_main.py", line 160, in pge_start
    pge.run()
  File "/home/rtc_user/opera/pge/base/base_pge.py", line 760, in run
    self.run_sas_executable(**kwargs)
  File "/home/rtc_user/opera/pge/base/base_pge.py", line 739, in run_sas_executable
    elapsed_time = time_and_execute(
  File "/home/rtc_user/opera/util/run_utils.py", line 216, in time_and_execute
    logger.critical(module_name, ErrorCode.SAS_PROGRAM_FAILED, error_msg)
  File "/home/rtc_user/opera/util/logger.py", line 407, in critical
    raise RuntimeError(description)
RuntimeError: Command "['/home/rtc_user/miniconda3/condabin/conda', 'run', '--no-capture-output', '-n', 'RTC', 'rtc_s1.py', '/home/rtc_user/scratch_dir/RunConfig_sas.yaml']" failed with exit code 1, traceback:
Removing burst directory: /home/rtc_user/output_dir/t160_342209_iw3
Removing burst directory: /home/rtc_user/output_dir/t160_342210_iw3
Removing burst directory: /home/rtc_user/output_dir/t160_342211_iw3
Removing burst directory: /home/rtc_user/output_dir/t160_342212_iw3
Removing burst directory: /home/rtc_user/output_dir/t160_342213_iw3
Removing burst directory: /home/rtc_user/output_dir/t160_342214_iw3
Removing burst directory: /home/rtc_user/output_dir/t160_342215_iw3
Removing burst directory: /home/rtc_user/output_dir/t160_342216_iw3
Removing burst directory: /home/rtc_user/output_dir/t160_342217_iw3
Removing output directory: /home/rtc_user/output_dir
[StdErr] Traceback (most recent call last):
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/bin/rtc_s1.py", line 41, in <module>
[StdErr]     main()
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/bin/rtc_s1.py", line 36, in main
[StdErr]     run_parallel(cfg, path_logfile_parent, flag_full_log_formatting)
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/lib/python3.9/site-packages/rtc/rtc_s1.py", line 627, in run_parallel
[StdErr]     os.rmdir(output_dir)
[StdErr] PermissionError: [Errno 13] Permission denied: '/home/rtc_user/output_dir'
ERROR conda.cli.main_run:execute(124): `conda run rtc_s1.py /home/rtc_user/scratch_dir/RunConfig_sas.yaml` failed. (See above for error)

ERROR conda.cli.main_run:execute(124): `conda run sh -c exec ${CONDA_ROOT}/bin/pge_docker_entrypoint.sh  "${@}" -- --file /home/rtc_user/runconfig/RunConfig.yaml` failed. (See above for error)

@collinss-jpl
Copy link
Collaborator

A little more background on this issue, there are actually two different errors occurring here:

One is a bug within the RTC SAS code itself, where if the processing fails on any single burst, the code attempts to delete all completed bursts to ensure we don't accidentally send an incomplete set of products to the DAAC. However, the cleanup code also attempts to delete the parent output directory we mount from outside the container, hence the Permission denined error. This bug was actually fixed in the RTC SAS repository, but since it was relatively minor, a new release that includes the fix was never created.

The other bug seems to be specific to this input SLC granule, and related to code in the ISCE3 package that the RTC SAS is built on top of:

[StdErr] Traceback (most recent call last):
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/bin/rtc_s1_single_job.py", line 35, in <module>
[StdErr]     run_single_job(cfg)
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/lib/python3.9/site-packages/rtc/rtc_s1_single_job.py", line 1656, in run_single_job
[StdErr]     rg_lut, az_lut = compute_correction_lut(
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/lib/python3.9/site-packages/rtc/rtc_s1_single_job.py", line 243, in compute_correction_lut
[StdErr]     gdal.Open(f'{scratch_path}/height.rdr', gdal.GA_ReadOnly).ReadAsArray()
[StdErr]   File "/home/rtc_user/miniconda3/envs/RTC/lib/python3.9/site-packages/osgeo/gdal.py", line 5296, in Open
[StdErr]     return _gdal.Open(*args)
[StdErr] RuntimeError: DSI or ACC record missing.  DTED access to
[StdErr] /home/rtc_user/scratch_dir/temp_1718146430.5104342/temp_1718146434.6384473/t160_342212_iw1//height.rdr failed.

When I informed @gshiroma about this error via email, this was his reply:

I was able to replicate the error you reported: “DSI or ACC record missing". From a quick check, the fix doesn’t seem to be very straightforward. It seems that the ISCE3 Topo module is saving its outputs, but one of its layers, height.rdr, is being saved with this DSI/ACC record issue. The strange thing is that the layer has the same number of bytes of another layer (incidence_angle.rdr) that doesn’t have the same issue.

That was about as far as the conversation went.

Question for @gshiroma: can you point me to the appropriate ISCE3 repository so I can create a bug ticket so we don't lose track of this issue?

@gshiroma
Copy link
Contributor

Thank you @niarenaw , @collinss-jpl , and @hhlee445 , for reporting this issue. Thank you also, @collinss-jpl , for creating a bug ticket so we can document and track this issue in the future.

The ISCE3 topo module is responsible for computing radar geometry parameters (e.g., interpolated DEM, inc. angle, local inc. angle, etc) over a radar grid (e.g., SLC grid). For each output layer, we create an ISCE3 raster object. In the OPERA RTC code, this is done here, where fname and dtype are assigned here. The ISCE3 module computes the geometry for each point. The height values are saved in the ISCE3 line here. For some reason, in some rare cases, the output height file height.rdr is missing those DSI or ACC record missing. This may be a problem in the ISCE3 topo module, but it could also be a problem in the ISCE3 Raster module, since the later is responsible for managing the interface between the data arrays and the file in the disk.

Thanks again, and please let me know if you have any comments or questions.

@hhlee445 hhlee445 added pcm.r02 PCM Release 2 must have labels Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working must have needs triage Issue that requires triage pcm.r02 PCM Release 2
Projects
None yet
Development

No branches or pull requests

4 participants