Containers that fail with no reason causes a panic #26

sherzberg · 2019-11-08T16:00:19Z

See #25 (comment) for more background.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x895b85]

goroutine 1 [running]:
github.com/buildkite/ecs-run-task/runner.writeContainerFinishedMessage(0xb226e0, 0xc00009c010, 0xc0002161c0, 0xc0001bd9e0, 0xc0001b82a0, 0x3a, 0x0)
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/runner/runner.go:261 +0x155
github.com/buildkite/ecs-run-task/runner.(*Runner).Run(0xc0000ac0e0, 0xb226e0, 0xc00009c010, 0x0, 0x1)
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/runner/runner.go:219 +0x12dd
main.main.func1(0xc0000e8580, 0x0, 0x0)
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/main.go:115 +0x625
github.com/urfave/cli.HandleAction(0x93db80, 0xa17268, 0xc0000e8580, 0x0, 0x0)
	/Users/lachlan/go/pkg/mod/github.com/urfave/cli@v1.20.0/app.go:490 +0xc8
github.com/urfave/cli.(*App).Run(0xc00014cea0, 0xc0000ac000, 0xe, 0xe, 0x0, 0x0)
	/Users/lachlan/go/pkg/mod/github.com/urfave/cli@v1.20.0/app.go:264 +0x57c
main.main()
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/main.go:125 +0x8c3

From the AWS console, this looks like a case where ECS doesn't even get to the point of launching a container, so we might be able to fallback to the ecs.Task.StoppedReason.

The text was updated successfully, but these errors were encountered:

fixes buildkite#26

sherzberg · 2019-11-08T17:21:42Z

Here is some more info around this particular panic.

Output from aws ecs describe-tasks --tasks REDACTED --cluster REDACTED

{
	"tasks": [{
		"taskArn": "arn:aws:ecs:us-east-1:REDACTED:task/REDACTED",
		"clusterArn": "arn:aws:ecs:us-east-1:REDACTED:cluster/REDACTED",
		"taskDefinitionArn": "arn:aws:ecs:us-east-1:REDACTED:task-definition/deREDACTED:REDACTED",
		"overrides": {
			"containerOverrides": [{
				"name": "app",
				"command": [
					"REDACTED"
				],
				"environment": [

				]
			}]
		},
		"lastStatus": "STOPPED",
		"desiredStatus": "STOPPED",
		"cpu": "256",
		"memory": "512",
		"containers": [{
			"containerArn": "arn:aws:ecs:us-east-1:REDACTED:container/REDACTED",
			"taskArn": "arn:aws:ecs:us-east-1:REDACTED:task/REDACTED",
			"name": "app",
			"lastStatus": "STOPPED",
			"networkInterfaces": [{
				"attachmentId": "REDACTED",
				"privateIpv4Address": "REDACTED"
			}],
			"healthStatus": "UNKNOWN",
			"cpu": "0"
		}],
		"version": 4,
		"stoppedReason": "Timeout waiting for network interface provisioning to complete.",
		"connectivity": "CONNECTED",
		"connectivityAt": 1573233270.396,
		"createdAt": 1573233066.948,
		"stoppingAt": 1573233252.398,
		"stoppedAt": 1573233282.065,
		"group": "family:REDACTED",
		"launchType": "FARGATE",
		"platformVersion": "1.3.0",
		"attachments": [{
			"id": "REDACTED",
			"type": "ElasticNetworkInterface",
			"status": "DELETED",
			"details": [{
					"name": "subnetId",
					"value": "REDACTED"
				},
				{
					"name": "networkInterfaceId",
					"value": "REDACTED"
				},
				{
					"name": "macAddress",
					"value": "REDACTED"
				},
				{
					"name": "privateIPv4Address",
					"value": "REDACTED"
				}
			]
		}],
		"healthStatus": "UNKNOWN",
		"tags": []
	}],
	"failures": []
}

Eli-Goldberg · 2020-02-16T11:32:40Z

Not sure, but it's possible I have a fix - testing it now.

When waiting for the task to finish there is a built in max attempts mechanism (which is 100 by default) and a built in delay (set for 1 minute by default).
So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".

You can try changing svc.WaitUntilTasksStopped to

err = svc.WaitUntilTasksStoppedWithContext(
		ctx,
		&ecs.DescribeTasksInput{
			Cluster: aws.String(r.Cluster),
			Tasks:   taskARNs,
		},

                // >>>>>>>>>> THESE <<<<<<<<<
		request.WithWaiterMaxAttempts(1),
		request.WithWaiterDelay(func(attempt int) time.Duration {
			return time.Second * 1
		}),
                // >>>>>>>>>>>> REPRODUCE THE ERROR <<<<<<<<<<<<
	)

And it will instantly throw the error above.

So - it seems the fix would be to provide an outside "delay" and "max-attempts" params,
as well as a longer default value.

sherzberg · 2020-02-18T03:30:08Z

@Eli-Goldberg have you tried out the PR I submitted? Does it fix your issue? #27

lox · 2020-02-19T03:36:30Z

/cc @pda

daroczig · 2020-03-11T22:17:07Z

So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".

I can confirm that -- seems like there's no way to run an ECS task that takes more than 10 mins :(

Eli-Goldberg · 2020-03-12T05:33:17Z

So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".

I can confirm that -- seems like there's no way to run an ECS task that takes more than 10 mins :(

I have a tested fix, will open a PR today

dannymidnight · 2020-06-02T05:53:31Z

Hey @Eli-Goldberg did you ever land the fix for this issue? I seem to be hitting this issue reasonably frequently.

Eli-Goldberg · 2020-06-02T05:55:27Z

Yeah. sorry, forgot to open a pr. Ill do that in a bit :)

…

On Tue, Jun 2, 2020, 08:53 Chris Campbell ***@***.***> wrote: Hey @Eli-Goldberg <https://github.com/Eli-Goldberg> did you ever land the fix for this issue? I seem to be hitting this issue reasonably frequently. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AESSMXGWORCF753PCUGDB4LRUSHWPANCNFSM4JKZRC5A> .

dannymidnight · 2020-06-02T06:17:39Z

Awesome! Thanks :)

Eli-Goldberg · 2020-06-02T09:39:29Z

I've opened a pr #35.

sherzberg added a commit to sherzberg/ecs-run-task that referenced this issue Nov 8, 2019

handle nil container Reason

9252043

fixes buildkite#26

Eli-Goldberg mentioned this issue Feb 16, 2020

Periodic ResourceNotReady: exceeded wait attempts errors #25

Open

trashhalo mentioned this issue Oct 4, 2022

nil pointer runner/runner.go:308 #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containers that fail with no reason causes a panic #26

Containers that fail with no reason causes a panic #26

sherzberg commented Nov 8, 2019

sherzberg commented Nov 8, 2019

Eli-Goldberg commented Feb 16, 2020 •

edited

Loading

sherzberg commented Feb 18, 2020

lox commented Feb 19, 2020

daroczig commented Mar 11, 2020

Eli-Goldberg commented Mar 12, 2020

dannymidnight commented Jun 2, 2020

Eli-Goldberg commented Jun 2, 2020 via email

dannymidnight commented Jun 2, 2020

Eli-Goldberg commented Jun 2, 2020

Containers that fail with no reason causes a panic #26

Containers that fail with no reason causes a panic #26

Comments

sherzberg commented Nov 8, 2019

sherzberg commented Nov 8, 2019

Eli-Goldberg commented Feb 16, 2020 • edited Loading

sherzberg commented Feb 18, 2020

lox commented Feb 19, 2020

daroczig commented Mar 11, 2020

Eli-Goldberg commented Mar 12, 2020

dannymidnight commented Jun 2, 2020

Eli-Goldberg commented Jun 2, 2020 via email

dannymidnight commented Jun 2, 2020

Eli-Goldberg commented Jun 2, 2020

Eli-Goldberg commented Feb 16, 2020 •

edited

Loading