Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to troubleshoot Resource Based Policies Scan Failures? #31

Closed
iankowsk opened this issue Sep 10, 2024 · 5 comments
Closed

How to troubleshoot Resource Based Policies Scan Failures? #31

iankowsk opened this issue Sep 10, 2024 · 5 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@iankowsk
Copy link

After running the Resource-Based Policies scan for all regions and services, the job has been failed with the status "Failed".

How to troubleshoot this error?

@morjoan
Copy link
Member

morjoan commented Sep 11, 2024

Hi, please see our guide for troubleshooting that issue here:
https://docs.aws.amazon.com/solutions/latest/account-assessment-for-aws-organizations/troubleshooting.html

@iankowsk
Copy link
Author

After troubleshooting, we have found the following error:

Screenshot 2024-09-16 at 10 43 14

After adding the following trust policy, the job started with no errors, but it still running for more than 48 hours.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

We are unable to take any action.

Screenshot 2024-09-16 at 10 47 42

Not able to delete it.

Screenshot 2024-09-16 at 10 54 27

Not able to restart it.

Screenshot 2024-09-16 at 10 48 44

There is no lambda function or step functions running.

Screenshot 2024-09-16 at 10 50 34

Questions:

  1. How can we stop/cancel the current job?
  2. How can we fix the error "The principal states.amazonaws.com it not authorized to assume the provided role"?

@tbelmega
Copy link
Member

Hi @iankowsk ,
which version of the solution are you running?

The "Scan running" message you see on the UI is related to the "JobStatus" field on the "lastJobMarker" item in DynamoDB table JobHistory. The "FailJob" step you see on the step function diagram is basically the "catch" clause of the step function, supposed to update the JobTable and and set the "lastJobMarker" from "RUNNING" to "FAILED". Since this update failed, the solution treats the job as still running and can't start another one.
This is obviously not supposed to happen. The immediate solution is to manually update the JobStatus field in DynamoDB from "RUNNING" to "FAILED" (see screenshot).

image

To be clear, no compute is technically "running", so the customer isn't billed for anything. It's just that the database entry representing the Job hasn't been updated from "RUNNING" to "FAILED" as it should have, so the solution logic treats the state as if a job was running and doesn't let you start a new one.

Regarding the actual cause for the failure, I can't tell why it fails to assume the role. Please contact me directly internally, I can help troubleshoot on Thursday or Friday if that helps.

@tbelmega tbelmega self-assigned this Sep 16, 2024
@iankowsk
Copy link
Author

Thanks for the quick reply, @tbelmega.

I will reach out to you to discuss how we can solve the role issue.

@dadmukta dadmukta added the help wanted Extra attention is needed label Sep 18, 2024
@tbelmega
Copy link
Member

Scheduled a troubleshooting session internally, closing this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants