Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAAC direct Bucket Access #793

Open
wildintellect opened this issue Aug 21, 2023 · 12 comments
Open

DAAC direct Bucket Access #793

wildintellect opened this issue Aug 21, 2023 · 12 comments
Assignees
Labels
ADE Algorithm Development Environment Subsystem DPS Data Processing Subsystem Enhancement New feature or request MSFC MSFC related issues

Comments

@wildintellect
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Not all DAACs have Federated Tokens working with S3 temporary credential endpoints, and managing temporary sessions. As part of the next generation of solutions from EOSDIS we, along with VEDA, are piloting direct bucket access with the same AWS region as EarthDataCloud (us-west-2)

Several DAACs have granted us read access:

  • ORNL (GEDI 3,4)
  • LPDAAC ( HLS, GEDI 1,2)
  • NSIDC (ATL0*)
  • GES DISC
  • PO DAAC <- not working at this time.

Describe the solution you'd like
User of the ADE and DPS need a way to make user of these credentials.
Possible options:

  • Swap the roles given to the DAACs, or those used by the DPS/ADE
  • Add a function to maap-py to assume-role and revert back as needed
  • ?

Describe alternatives you've considered
User could manually code Role switching themselves.

Additional context
For testing in MCP the MAAP-ADE-K8S role was given trust to assume maap-data-reader. switching back was not tested yet, and probably requires trust the other direction.

The following roles have permissions
MAAP Prod account, on MCP
arn:aws:iam::8_7:role/maap-data-reader
arn:aws:iam::8_7:role/tiler-lambda-role
arn:aws:iam::8_7:role/maap-data-manager

MAAP Dev account, on SMCE
arn:aws:iam::9_4:role/maap-data-reader-dev
arn:aws:iam::9_4:role/maap-data-manager-dev

To test currently you have to do:

export $(printf "AWS_ACCESS_KEY_ID=%s AWS_SECRET_ACCESS_KEY=%s AWS_SESSION_TOKEN=%s" \
$(aws sts assume-role \
--role-arn arn:aws:iam::884094767067:role/maap-data-reader \
--role-session-name TestSessionName \
--query "Credentials.[AccessKeyId,SecretAccessKey,SessionToken]" \
--output text))

Then you can try to access

aws s3 ls s3://nsidc-cumulus-prod-protected/ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_02.h5
@wildintellect wildintellect added Enhancement New feature or request MSFC MSFC related issues ADE Algorithm Development Environment Subsystem DPS Data Processing Subsystem labels Aug 21, 2023
@wildintellect
Copy link
Collaborator Author

wildintellect commented Sep 8, 2023

I have a working example with Python code now. It should be simple enough to supply the ARNs via the maap-py library (possibly as SSM parameters)

https://gist.github.com/wildintellect/e561eccdddee851a571004cf1fbe83b8

Funny story, GEDI L4B seems to have more open permissions, or ORNL granted permission to the ADE role too. So I switched to testing with GES DISC data.

@chuckwondo
Copy link
Collaborator

@wildintellect, here's another possible approach that works within the ADE:

aws configure --profile maap-data-reader set role_arn arn:aws:iam::884094767067:role/maap-data-reader
aws configure --profile maap-data-reader set credential_source Ec2InstanceMetadata
aws configure --profile maap-data-reader set role_session_name DAAC_Direct  # optional

Now, the AWS CLI and AWS SDK will automatically obtain the necessary credentials when using the maap-data-reader profile (or whichever profile name you choose to use above). This also means that the credentials are not only cached (under ~/.aws/cli/cache/), but also that they are automatically refreshed when they expire.

For example, using the CLI:

$ AWS_PROFILE=maap-data-reader aws s3 ls s3://nsidc-cumulus-prod-protected/ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5
2023-08-31 12:29:38  308854414 ATL08_20230416235213_04061911_006_03.h5
2023-08-31 12:30:22    5902731 ATL08_20230416235213_04061911_006_03.h5.dmrpp

Using Python:

$ AWS_PROFILE=maap-data-reader python
>>> import boto3
>>> s3 = boto3.client("s3")
>>> response = s3.list_objects_v2(Bucket="nsidc-cumulus-prod-protected", Prefix="ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5")
>>> contents = response["Contents"]
>>> for item in contents: print(item["Key"])
... 
ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5
ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5.dmrpp
>>> 

@wildintellect
Copy link
Collaborator Author

That means everything will operate under that profile? Will this cause issues for bucket permissions inside MAAP? When running DPS jobs will this cause problems interacting with DPS (writing outputs), etc...
I think part of the reason to use the SSM approach in python was that you can apply it to a context as needed, but not have to revert back your role afterwards (unlike assuming a role in cli).

We should also confirm that awscli made it in to all the 3.1.4 images and above.

@chuckwondo
Copy link
Collaborator

I'm not following what you mean about the SSM approach. How do you envision SSM parameters being used?

Regarding use of an AWS profile, we can scope the profile to a particular session, like so:

$ python
>>> import boto3
>>> session = boto3.session(profile_name="maap-data-reader")
>>> s3 = session.client("s3")
>>> response = s3.list_objects_v2(Bucket="nsidc-cumulus-prod-protected", Prefix="ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5")
>>> contents = response["Contents"]
>>> for item in contents: print(item["Key"])
... 
ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5
ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5.dmrpp
>>> 

Regarding maap-py, I'm going to create an issue for enhancement, where the MAAP initializer can accept both a boto3.Session instance and a requests.Session instance. In both cases, if no session object is supplied, a default session object would be created, and the various places where requests are currently made would be updated to use the relevant session to make requests.

Thus, in order to leverage the idea shown above, we might then be able to do something like so (simplified) to download granules from S3 because the maap instance would pass its boto3_session instance through to each Result object returned from search_granules, so that the Result.getData method can make use of the session.

maap = MAAP("api.maap-project.org", boto3_session=boto3.session(profile_name="maap-data-reader"))
granules = maap.search_granules(...)
granules[0].getData()

This can all be done without introducing breaking changes.

@wildintellect
Copy link
Collaborator Author

@chuckwondo we actually already have docs on this. Though we don't show how to pass it to maap-py.
https://docs.maap-project.org/en/latest/technical_tutorials/access/direct_access.html

@chuckwondo
Copy link
Collaborator

@wildintellect, thanks for the link to the docs. The only downside to that approach is that credentials are not automatically refreshed, so long-running programs might run into errors due to expired credentials.

As part of the research I did several months ago into configuring custom boto3/botocore credentials refreshers, which I will be incorporating into maap-py, I'll work on showing how we can tweak that example in the docs such that we get automatically refreshed credentials (based on what I'll do for maap-py, but not depending on those maap-py changes for the doc example).

See MAAP-Project/maap-py#83

@wildintellect
Copy link
Collaborator Author

Hmm, how long does our current method work; 1 hour, 12 hours? Would be good to document.

@chuckwondo
Copy link
Collaborator

It's 1 hour. Here's a response (partially redacted):

{
  "Credentials": {
    "AccessKeyId": "***",
    "SecretAccessKey": "***",
    "SessionToken": "***",
    "Expiration": "2024-03-04T23:40:40+00:00"
  },
  "AssumedRoleUser": {
    "AssumedRoleId": "***:botocore-session-1709592040",
    "Arn": "arn:aws:sts::***:assumed-role/maap-data-reader/botocore-session-1709592040"
  },
  "ResponseMetadata": {
    "RequestId": "f86a9f44-5fd3-4771-9b2f-28cfae125d62",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "f86a9f44-5fd3-4771-9b2f-28cfae125d62",
      "content-type": "text/xml",
      "content-length": "1513",
      "date": "Mon, 04 Mar 2024 22:40:40 GMT"
    },
    "RetryAttempts": 0
  }
}

Specifically, we can see that the expiration is 1 hour after the request time:

  • "Credentials.Expiration": "2024-03-04T23:40:40+00:00"
  • "ResponseMetadata.HTTPHeaders.date": "Mon, 04 Mar 2024 22:40:40 GMT"

@wildintellect
Copy link
Collaborator Author

wildintellect commented Mar 5, 2024

I looked into this a little, we can increase the duration of these keys up to 12 hours. Would that be helpful to simply reduce the frequency of refreshes needed?

Once the role properties are changed, adding the DurationSeconds would increase the longevity of the session validity.

assumed_role_object = sts.assume_role(
        RoleArn=parameter_value,
        RoleSessionName='TutorialSession',
        DurationSeconds=43200
    )

@chuckwondo
Copy link
Collaborator

I looked into this a little, we can increase the duration of these keys up to 12 hours. Would that be helpful to simply reduce the frequency of refreshes needed?

That's an option that might suffice, but I still view it as a bit of a band-aid. Ideally, we want auto-refresh to occur so we don't even care (nor worry about) how long individual creds last.

@wildintellect
Copy link
Collaborator Author

New Task - find a code way to apply the 12 hour limit to the policy.
Then we'll then open a new ticket about dealing with refreshing of tokens as needed.

@chuckwondo
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADE Algorithm Development Environment Subsystem DPS Data Processing Subsystem Enhancement New feature or request MSFC MSFC related issues
Projects
None yet
Development

No branches or pull requests

4 participants