Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Containers that fail with no reason causes a panic #26

Open
sherzberg opened this issue Nov 8, 2019 · 10 comments
Open

Containers that fail with no reason causes a panic #26

sherzberg opened this issue Nov 8, 2019 · 10 comments

Comments

@sherzberg
Copy link
Contributor

See #25 (comment) for more background.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x895b85]

goroutine 1 [running]:
github.com/buildkite/ecs-run-task/runner.writeContainerFinishedMessage(0xb226e0, 0xc00009c010, 0xc0002161c0, 0xc0001bd9e0, 0xc0001b82a0, 0x3a, 0x0)
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/runner/runner.go:261 +0x155
github.com/buildkite/ecs-run-task/runner.(*Runner).Run(0xc0000ac0e0, 0xb226e0, 0xc00009c010, 0x0, 0x1)
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/runner/runner.go:219 +0x12dd
main.main.func1(0xc0000e8580, 0x0, 0x0)
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/main.go:115 +0x625
github.com/urfave/cli.HandleAction(0x93db80, 0xa17268, 0xc0000e8580, 0x0, 0x0)
	/Users/lachlan/go/pkg/mod/github.com/urfave/cli@v1.20.0/app.go:490 +0xc8
github.com/urfave/cli.(*App).Run(0xc00014cea0, 0xc0000ac000, 0xe, 0xe, 0x0, 0x0)
	/Users/lachlan/go/pkg/mod/github.com/urfave/cli@v1.20.0/app.go:264 +0x57c
main.main()
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/main.go:125 +0x8c3

From the AWS console, this looks like a case where ECS doesn't even get to the point of launching a container, so we might be able to fallback to the ecs.Task.StoppedReason.

sherzberg added a commit to sherzberg/ecs-run-task that referenced this issue Nov 8, 2019
@sherzberg
Copy link
Contributor Author

Here is some more info around this particular panic.

Output from aws ecs describe-tasks --tasks REDACTED --cluster REDACTED

{
	"tasks": [{
		"taskArn": "arn:aws:ecs:us-east-1:REDACTED:task/REDACTED",
		"clusterArn": "arn:aws:ecs:us-east-1:REDACTED:cluster/REDACTED",
		"taskDefinitionArn": "arn:aws:ecs:us-east-1:REDACTED:task-definition/deREDACTED:REDACTED",
		"overrides": {
			"containerOverrides": [{
				"name": "app",
				"command": [
					"REDACTED"
				],
				"environment": [

				]
			}]
		},
		"lastStatus": "STOPPED",
		"desiredStatus": "STOPPED",
		"cpu": "256",
		"memory": "512",
		"containers": [{
			"containerArn": "arn:aws:ecs:us-east-1:REDACTED:container/REDACTED",
			"taskArn": "arn:aws:ecs:us-east-1:REDACTED:task/REDACTED",
			"name": "app",
			"lastStatus": "STOPPED",
			"networkInterfaces": [{
				"attachmentId": "REDACTED",
				"privateIpv4Address": "REDACTED"
			}],
			"healthStatus": "UNKNOWN",
			"cpu": "0"
		}],
		"version": 4,
		"stoppedReason": "Timeout waiting for network interface provisioning to complete.",
		"connectivity": "CONNECTED",
		"connectivityAt": 1573233270.396,
		"createdAt": 1573233066.948,
		"stoppingAt": 1573233252.398,
		"stoppedAt": 1573233282.065,
		"group": "family:REDACTED",
		"launchType": "FARGATE",
		"platformVersion": "1.3.0",
		"attachments": [{
			"id": "REDACTED",
			"type": "ElasticNetworkInterface",
			"status": "DELETED",
			"details": [{
					"name": "subnetId",
					"value": "REDACTED"
				},
				{
					"name": "networkInterfaceId",
					"value": "REDACTED"
				},
				{
					"name": "macAddress",
					"value": "REDACTED"
				},
				{
					"name": "privateIPv4Address",
					"value": "REDACTED"
				}
			]
		}],
		"healthStatus": "UNKNOWN",
		"tags": []
	}],
	"failures": []
}

@Eli-Goldberg
Copy link
Contributor

Eli-Goldberg commented Feb 16, 2020

Not sure, but it's possible I have a fix - testing it now.

When waiting for the task to finish there is a built in max attempts mechanism (which is 100 by default) and a built in delay (set for 1 minute by default).
So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".

You can try changing svc.WaitUntilTasksStopped to

err = svc.WaitUntilTasksStoppedWithContext(
		ctx,
		&ecs.DescribeTasksInput{
			Cluster: aws.String(r.Cluster),
			Tasks:   taskARNs,
		},

                // >>>>>>>>>> THESE <<<<<<<<<
		request.WithWaiterMaxAttempts(1),
		request.WithWaiterDelay(func(attempt int) time.Duration {
			return time.Second * 1
		}),
                // >>>>>>>>>>>> REPRODUCE THE ERROR <<<<<<<<<<<<
	)

And it will instantly throw the error above.

So - it seems the fix would be to provide an outside "delay" and "max-attempts" params,
as well as a longer default value.

@sherzberg
Copy link
Contributor Author

@Eli-Goldberg have you tried out the PR I submitted? Does it fix your issue? #27

@lox
Copy link
Contributor

lox commented Feb 19, 2020

/cc @pda

@daroczig
Copy link

So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".

I can confirm that -- seems like there's no way to run an ECS task that takes more than 10 mins :(

@Eli-Goldberg
Copy link
Contributor

So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".

I can confirm that -- seems like there's no way to run an ECS task that takes more than 10 mins :(

I have a tested fix, will open a PR today

@dannymidnight
Copy link

Hey @Eli-Goldberg did you ever land the fix for this issue? I seem to be hitting this issue reasonably frequently.

@Eli-Goldberg
Copy link
Contributor

Eli-Goldberg commented Jun 2, 2020 via email

@dannymidnight
Copy link

Awesome! Thanks :)

@Eli-Goldberg
Copy link
Contributor

I've opened a pr #35.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants