How to troubleshoot Resource Based Policies Scan Failures? #31

iankowsk · 2024-09-10T18:16:33Z

After running the Resource-Based Policies scan for all regions and services, the job has been failed with the status "Failed".

How to troubleshoot this error?

morjoan · 2024-09-11T19:51:15Z

Hi, please see our guide for troubleshooting that issue here:
https://docs.aws.amazon.com/solutions/latest/account-assessment-for-aws-organizations/troubleshooting.html

iankowsk · 2024-09-16T13:57:27Z

After troubleshooting, we have found the following error:

After adding the following trust policy, the job started with no errors, but it still running for more than 48 hours.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

We are unable to take any action.

Not able to delete it.

Not able to restart it.

There is no lambda function or step functions running.

Questions:

How can we stop/cancel the current job?
How can we fix the error "The principal states.amazonaws.com it not authorized to assume the provided role"?

tbelmega · 2024-09-16T17:48:27Z

Hi @iankowsk ,
which version of the solution are you running?

The "Scan running" message you see on the UI is related to the "JobStatus" field on the "lastJobMarker" item in DynamoDB table JobHistory. The "FailJob" step you see on the step function diagram is basically the "catch" clause of the step function, supposed to update the JobTable and and set the "lastJobMarker" from "RUNNING" to "FAILED". Since this update failed, the solution treats the job as still running and can't start another one.
This is obviously not supposed to happen. The immediate solution is to manually update the JobStatus field in DynamoDB from "RUNNING" to "FAILED" (see screenshot).

To be clear, no compute is technically "running", so the customer isn't billed for anything. It's just that the database entry representing the Job hasn't been updated from "RUNNING" to "FAILED" as it should have, so the solution logic treats the state as if a job was running and doesn't let you start a new one.

Regarding the actual cause for the failure, I can't tell why it fails to assume the role. Please contact me directly internally, I can help troubleshoot on Thursday or Friday if that helps.

iankowsk · 2024-09-16T18:42:30Z

Thanks for the quick reply, @tbelmega.

I will reach out to you to discuss how we can solve the role issue.

tbelmega · 2024-09-18T16:37:42Z

Scheduled a troubleshooting session internally, closing this ticket.

tbelmega self-assigned this Sep 16, 2024

dadmukta added the help wanted Extra attention is needed label Sep 18, 2024

tbelmega closed this as completed Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to troubleshoot Resource Based Policies Scan Failures? #31

How to troubleshoot Resource Based Policies Scan Failures? #31

iankowsk commented Sep 10, 2024

morjoan commented Sep 11, 2024

iankowsk commented Sep 16, 2024

tbelmega commented Sep 16, 2024

iankowsk commented Sep 16, 2024

tbelmega commented Sep 18, 2024

How to troubleshoot Resource Based Policies Scan Failures? #31

How to troubleshoot Resource Based Policies Scan Failures? #31

Comments

iankowsk commented Sep 10, 2024

morjoan commented Sep 11, 2024

iankowsk commented Sep 16, 2024

tbelmega commented Sep 16, 2024

iankowsk commented Sep 16, 2024

tbelmega commented Sep 18, 2024