Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Geocoded burst stack processor #18

Closed
wants to merge 12 commits into from

Conversation

vbrancat
Copy link
Contributor

This PR adds a script for the geocoded burst stack processor.

The script works on a static stack of S1-A/B SAFE files and operates at a burst level.
Given a reference acquisition date in the stack, the script identifies the common bursts among the static stack of SAFE files and generate a runconfig and a bash file to use to run the geocoded burst workflow.

Stack of burst with the same burst IDs are geocoded on the same grid which their stitching easier.
If the orbit directory is not provided, the script will automatically download the orbit ephemerides files for each SAFE file of the stack.

TO DO:

  1. Have the script be capable of reading options from a configuration file

@vbrancat vbrancat requested review from hfattahi, LiangJYu and yunjunz May 10, 2022 20:22
@hfattahi
Copy link
Contributor

@vbrancat I don't think if you have pushed any code. Would you please check and push the code?

@hfattahi
Copy link
Contributor

@vbrancat now that the repo reorganization is merged, would you please rebase so that we can start evaluating this PR?

@vbrancat
Copy link
Contributor Author

@hfattahi let me try to test the branch. I will push all the updates when ready :s

with open(runfile_name, 'w') as rsh:
path = os.path.dirname(os.path.realpath(__file__))
rsh.write(
f'python {path}/geo_cslc.py {runconfig_path}\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f'python {path}/geo_cslc.py {runconfig_path}\n')
f'python {path}/s1_cslc.py {runconfig_path}\n')

bursts = load_bursts(safe, orbit_path, subswath)
for burst in bursts:
if burst.burst_id in list(set(burst_map['burst_id'])):
runconfig_path = create_runconfig(burst, safe, orbit_path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently runconfig_path is relative path. It would be great to make it absolute path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the runconfig_path is already an absolute_path. If you inspect one of the generated run_file you can see that e.g.,

python /u/aurora0/vbrancat/COMPASS/src/compass/s1_cslc.py /mnt/aurora-r0/vbrancat/data/S1/stack_processor/Rosamond/g
eocoded//runconfigs/geo_runconfig_20220501_t64_iw3_b207.yaml

@hfattahi
Copy link
Contributor

Very exciting PR. Before dig into the code, I'm trying to be a user and giving it a try. Provided two minor comments above. One nice functionality would be the start and end date. Let's assume there is a folder with all SLCs, but we want to process only certain dates. It would be nice to have that flexibility. Other relevant functionality would be allowing to process a list of specific dates. Or exclude a list of specific dates.

Comment on lines +494 to +496
with open(runfile_name, 'w') as rsh:
path = os.path.dirname(os.path.realpath(__file__))
rsh.write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently creates many run files which each of them has only one command to run. An alternative would be to have one run file with many commands at each line. The latter allows to run multiple commands in parallel for example using this. I wonder if it is preferred to have one run file with one command per line or multiple run files with one command per file?
Maybe @seongsujeong has a preference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seongsujeong before I make any modification, what is your opinion with respect to the comment above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been running them in parallel like

ls run_files  | xargs -P 10 sh -c 'bash run_files/$0 > logfile-$0.log'

so for now I've kept them as separate files

@vbrancat
Copy link
Contributor Author

@hfattahi latest commit should address the following:

  1. Inspect the stack and identifying the common bursts along the stack. If no burst_id is specified for processing, only the common bursts are used
  2. Specify a start_date and an end_date. If specified, all the burst IDs with dates in this interval will be processed
  3. Ability to exclude dates from the stack

I should do some code clean-up (move orbits download to the reader), improve panda dataframe filtering (it is a bit slow)

Comment on lines +128 to +201
def get_orbit_dict(sensor_id, start_time, end_time, orbit_type):
'''
Query Copernicus GNSS API to find latest orbit file

Parameters:
----------
sensor_id: str
Sentinel satellite identifier ('A' or 'B')
start_time: datetime object
Sentinel start acquisition time
end_time: datetime object
Sentinel end acquisition time
orbit_type: str
Type of orbit to download (AUX_POEORB: precise, AUX_RESORB: restituted)

Returns:
orbit_dict: dict
Python dictionary with [orbit_name, orbit_type, download_url]
'''
# Check if correct orbit_type
if orbit_type not in ['AUX_POEORB', 'AUX_RESORB']:
err_msg = f'{orbit_type} not a valid orbit type'
raise ValueError(err_msg)

# Add a 30 min margin to start_time and end_time
pad_start_time = start_time - datetime.timedelta(hours=0.5)
pad_end_time = end_time + datetime.timedelta(hours=0.5)
new_start_time = pad_start_time.strftime('%Y-%m-%dT%H:%M:%S')
new_end_time = pad_end_time.strftime('%Y-%m-%dT%H:%M:%S')
query_string = f"startswith(Name,'S1{sensor_id}') and substringof('{orbit_type}',Name) " \
f"and ContentDate/Start lt datetime'{new_start_time}' and ContentDate/End gt datetime'{new_end_time}'"
query_params = {'$top': 1, '$orderby': 'ContentDate/Start asc',
'$filter': query_string}
query_response = requests.get(url=scihub_url, params=query_params,
auth=(scihub_user, scihub_password))
# Parse XML tree from query response
xml_tree = ElementTree.fromstring(query_response.content)
# Extract w3.org URL
w3_url = xml_tree.tag.split('feed')[0]

# Extract orbit's name, id, url
orbit_id = xml_tree.findtext(
f'.//{w3_url}entry/{m_url}properties/{d_url}Id')
orbit_url = f"{scihub_url}('{orbit_id}')/$value"
orbit_name = xml_tree.findtext(f'./{w3_url}entry/{w3_url}title')

if orbit_id is not None:
orbit_dict = {'orbit_name': orbit_name, 'orbit_type': orbit_type,
'orbit_url': orbit_url}
else:
orbit_dict = None
return orbit_dict


def download_orbit(output_folder, orbit_url):
'''
Download S1-A/B orbits

Parameters:
----------
output_folder: str
Path to directory where to store orbits
orbit_url: str
Remote url of orbit file to download
'''

response = requests.get(url=orbit_url, auth=(scihub_user, scihub_password))
# Get header and find filename
header = response.headers['content-disposition']
_, header_params = cgi.parse_header(header)
# construct orbit filename
orbit_filename = os.path.join(output_folder, header_params['filename'])
# Save orbits
open(orbit_filename, 'wb').write(response.content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually both replace with this?

Copy link
Contributor Author

@vbrancat vbrancat Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this will be replaced with the orbit download functionality in the s1-reader. I will wait for that PR to be merged before applying any related modification to this PR

Comment on lines +513 to +514
if burst_id is not None:
burst_map = prune_dataframe(burst_map, 'burst_id', burst_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can avoid this prune_dataframe call if generate_burst_map takes in burst_id and uses it to filter bursts. Like this.

Comment on lines +517 to +524
if start_date is not None:
burst_map = burst_map[burst_map['date'] >= start_date]
if end_date is not None:
burst_map = burst_map[burst_map['date'] <= end_date]

# Exclude some dates if the user requires it
if exclude_dates is not None:
burst_map = prune_dataframe(burst_map, 'date', exclude_dates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like burst_id, these bits of pruning could be filtered in generate_burst_map as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The start and end date pruning is now occurring before the generation, so that the number of zip files to load is minimized

orbit_path = get_orbit_file_from_list(safe,
glob.glob(f'{orbit_dir}/S1*'))
for subswath in i_subswath:
bursts = load_bursts(safe, orbit_path, subswath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Sentinel1BurstSlc objects are stored in burst_map then they don't need to reloaded here.

If my suggestions to filter in generate_burst_map are accepted, then that should keep the burst_map object not too big?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestions is implemented in #53 , and it means that the script run time is cut in half (since 95% or more of the script time is spent loading, and every zip was loaded twice)

is_split_spectrum,
low_band, high_band, pol,
x_spac, y_spac)
date_str = str(date)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line should be moved to right after line 534. Right now if the "else" triggers, the logging entry on line 552 uses date_str, but it isn't defined.

@scottstanie
Copy link
Contributor

I believe I've addressed the comments here, so moving the final implementation to #53

@vbrancat vbrancat deleted the geo_burst_stack branch November 17, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants