Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Granule Data Field Returns "https" when in region s3 data are available #883

Open
1 task done
meteodave opened this issue Nov 26, 2024 · 6 comments
Open
1 task done
Labels
type: bug Something isn't working

Comments

@meteodave
Copy link

meteodave commented Nov 26, 2024

Is this issue already tracked somewhere, or is this a new report?

  • I've reviewed existing issues and couldn't find a duplicate for this problem.

Current Behavior

I seem to be having an issue accessing LAADS data with earthaccess 0.12.0. The "Data" field of the granule returns only the "https" link but these data are in the Cloud according to earthdata search.

image

image

Expected Behavior

I would expect "Data" in the granule to return the S3 path link.

Steps To Reproduce

In Jupyter Notebook:

import earthaccess
from pprint import pprint
import boto3

auth = earthaccess.login(persist=True)
granules = earthaccess.search_data(concept_id = 'C2859273114-LAADS', temporal = ('2019-09-26','2019-09-27'))

if (boto3.client('s3').meta.region_name == 'us-west-2'):
    print("found US-West-2")
else: 
    print("US-West-2 not found")

print(granules[0])

Output:

found US-West-2
Collection: {'ShortName': 'XAERDT_L2_ABI_G16', 'Version': '1'}
Spatial coverage: {'HorizontalSpatialDomain': {'Geometry': {'GPolygons': [{'Boundary': {'Points': [{'Longitude': -147.0, 'Latitude': -72.0}, {'Longitude': -3.0, 'Latitude': -72.0}, {'Longitude': -3.0, 'Latitude': 72.0}, {'Longitude': -147.0, 'Latitude': 72.0}, {'Longitude': -147.0, 'Latitude': -72.0}]}}]}}}
Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2019-09-25T23:50:00.000Z', 'EndingDateTime': '2019-09-26T00:00:00.000Z'}}
Size(MB): 49.1438646316528
Data: ['https://data.laadsdaac.earthdatacloud.nasa.gov/prod-lads/XAERDT_L2_ABI_G16/XAERDT_L2_ABI_G16.A2019268.2350.001.2023253054738.nc']

Environment

- OS:Debian GNU/Linux 11
- Python: 3.12.7
- earthaccess: 0.12.0

Additional Context

No response

@mfisher87 mfisher87 added the type: bug Something isn't working label Nov 26, 2024
@asteiker
Copy link
Member

asteiker commented Dec 5, 2024

@meteodave Thanks for reporting this. I don't think this is unique to your LAADS example, as I see the same results when searching for an ICESat-2 collection in the cloud. I believe that the search_data results are only grabbing the first data access URL, which would be the HTTPS link in this case. @betolink does that sound right to you? Regardless, the s3 URL should still be found and utilized when using earthaccess.open().

So, I don't know if this is truly a bug versus an enhancement that we need to make to search_data() to provide all data access URLs that exist for the granule results, including s3.

@betolink
Copy link
Member

betolink commented Dec 14, 2024

Thanks for stopping by the poster @meteodave it was great meeting you in person! Not a full answer but some clarifications, the data_links() method to defaults to "out-of-region" for the representation, this means we'll always see the output you're seeing, which perhaps is a bug! Internally however if we use .download(granules) or .open(granules) it will check if we are in-region... which is also tricky as some instances and frameworks hide the required metadata to know if we are in us-west-2 or not.

We are having conversations around what should be the default, the best option so far is to assume that we are in the cloud and try the s3:linksif they are reachable. As for the representation, we may need to change the default to follow the same logic or even show both like:

Collection: {'ShortName': 'XAERDT_L2_ABI_G16', 'Version': '1'}
Spatial coverage: {'HorizontalSpatialDomain': {'Geometry': {'GPolygons': [{'Boundary': {'Points': [{'Longitude': -147.0, 'Latitude': -72.0}, {'Longitude': -3.0, 'Latitude': -72.0}, {'Longitude': -3.0, 'Latitude': 72.0}, {'Longitude': -147.0, 'Latitude': 72.0}, {'Longitude': -147.0, 'Latitude': -72.0}]}}]}}}
Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2019-09-25T23:50:00.000Z', 'EndingDateTime': '2019-09-26T00:00:00.000Z'}}
Size(MB): 49.1438646316528
Data: {
      S3: ['s3://nasa-s3-url/granule.nc'],
      HTTP: ['https://nasa-http-url/granule.nc']
}

@meteodave
Copy link
Author

@betolink

Note that earthaccess 0.13.0 (earthaccess 0.13.0 pyhd8ed1ab_0 conda-forge) does not return an S3 data link in the granules output. I am running this command in AWS-US-West-2.

granules = earthaccess.search_data(concept_id = 'C2859273114-LAADS', temporal = ('2019-09-26','2019-09-27'), cloud_hosted=True)

[Collection: {'ShortName': 'XAERDT_L2_ABI_G16', 'Version': '1'}
Spatial coverage: {'HorizontalSpatialDomain': {'Geometry': {'GPolygons': [{'Boundary': {'Points': [{'Longitude': -147.0, 'Latitude': -72.0}, {'Longitude': -3.0, 'Latitude': -72.0}, {'Longitude': -3.0, 'Latitude': 72.0}, {'Longitude': -147.0, 'Latitude': 72.0}, {'Longitude': -147.0, 'Latitude': -72.0}]}}]}}}
Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2019-09-25T23:50:00.000Z', 'EndingDateTime': '2019-09-26T00:00:00.000Z'}}
Size(MB): 49.1438646316528
Data: ['https://data.laadsdaac.earthdatacloud.nasa.gov/prod-lads/XAERDT_L2_ABI_G16/XAERDT_L2_ABI_G16.A2019268.2350.001.2023253054738.nc'], Collection: {'ShortName': 'XAERDT_L2_ABI_G16', 'Version': '1'}

@chuckwondo
Copy link
Collaborator

@betolink

Note that earthaccess 0.13.0 (earthaccess 0.13.0 pyhd8ed1ab_0 conda-forge) does not return an S3 data link in the granules output. I am running this command in AWS-US-West-2.

granules = earthaccess.search_data(concept_id = 'C2859273114-LAADS', temporal = ('2019-09-26','2019-09-27'), cloud_hosted=True)

[Collection: {'ShortName': 'XAERDT_L2_ABI_G16', 'Version': '1'}
Spatial coverage: {'HorizontalSpatialDomain': {'Geometry': {'GPolygons': [{'Boundary': {'Points': [{'Longitude': -147.0, 'Latitude': -72.0}, {'Longitude': -3.0, 'Latitude': -72.0}, {'Longitude': -3.0, 'Latitude': 72.0}, {'Longitude': -147.0, 'Latitude': 72.0}, {'Longitude': -147.0, 'Latitude': -72.0}]}}]}}}
Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2019-09-25T23:50:00.000Z', 'EndingDateTime': '2019-09-26T00:00:00.000Z'}}
Size(MB): 49.1438646316528
Data: ['https://data.laadsdaac.earthdatacloud.nasa.gov/prod-lads/XAERDT_L2_ABI_G16/XAERDT_L2_ABI_G16.A2019268.2350.001.2023253054738.nc'], Collection: {'ShortName': 'XAERDT_L2_ABI_G16', 'Version': '1'}

If the granule contains s3 links, they will not be shown in that output. Only HTTP links are shown.

What you'll want to do is pick a granule and get the "direct" data links, something like so:

granules = earthaccess.search_data(concept_id = 'C2859273114-LAADS', temporal = ('2019-09-26','2019-09-27'), cloud_hosted=True)
granules[0].data_links("direct")

That will show you s3 links, if there are any.

@betolink
Copy link
Member

What you're seeing is a "summary" but the links are there, in fact all the metadata is in each of the results.

e.g.

granules[0].data_links(access="direct") 

we could iterate on the results to collect all these s3 links:

links = [g.data_links(access="direct")[0] for g in granules] # we are assuming the granule has only one file

then we could open or download these files, for this earthaccess needs to know the DAAC provider. When we pass the results directly it tries to infer this information but then we get into the situation of correctly detecting if we are in AWS or not.

files = earthaccess.download(links, "./", provider="LAADS")

@chuckwondo
Copy link
Collaborator

Also, there's no need to be in-region when simply querying. Here's what I get running directly on my computer:

>>> import earthaccess
>>> granules = earthaccess.search_data(concept_id = 'C2859273114-LAADS', temporal = ('2019-09-26','2019-09-27'), cloud_hosted=True)
>>> granules[0].data_links("direct")
['s3://prod-lads/XAERDT_L2_ABI_G16/XAERDT_L2_ABI_G16.A2019268.2350.001.2023253054738.nc']
>>> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Status: 🆕 New
Development

No branches or pull requests

5 participants