Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul connect is extremely unreliable at startup #9307

Closed
kneufeld opened this issue Nov 10, 2020 · 35 comments · Fixed by #23381
Closed

consul connect is extremely unreliable at startup #9307

kneufeld opened this issue Nov 10, 2020 · 35 comments · Fixed by #23381
Assignees
Labels
Milestone

Comments

@kneufeld
Copy link
Contributor

I've just had multiple multi-hour outages in the past two days because consul connect proxies would not start.

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1

That's it, no logs because it never actually started. Granted this is a complicated job with 5 groups and proxies between them but it used to work. It seems as soon as I had multiple jobs (built from a template) the proxies started to act up. Proxy names are all unique based on job name and group.

All I can do it just repeatedly hit restart until the gods smile on me and the proxy finally starts.

Nomad version

Started at 0.11.6 and desperately tried to upgrade to 0.12.7 to no avail.

Nomad v0.12.7 (6147cb578794cb2d0c35d68fe1791728a09bb081)

Operating system and Environment details

Ubuntu 18.04 and Ubuntu 20.04

Reproduction steps

Make a job with multiple groups and multiple connect sidecar between them (but not circular). The first instance of that job will probably work. Make 10 more.

After investing a lot of time and energy in nomad I'm afraid I'll have drop it for I have no idea what I'm going to do...

@shoenig
Copy link
Member

shoenig commented Nov 10, 2020

Hi @kneufeld sorry you're having trouble with this.

Seeing the exit status 1 error message for creating bootstrap configuration indicates something went wrong when Nomad execs consul to retrieve the envoy bootstrap configuration from Consul. Did you happen to capture Consul logs while this is happening, which may shed some light into what's going on? What version of Consul are you using?

@kneufeld
Copy link
Contributor Author

kneufeld commented Nov 10, 2020

Nov 10 08:54:21 clients-nomad1.example.com consul[6571]:     2020-11-10T08:54:21.287-0800 [ERROR] agent.proxycfg: watch error: service_id=_nomad-task-bb39ddd4-951b-52d2-12a7-8bb290adebef-group-listener-     <redacted>-proxy-redis--sidecar-proxy id=service-http-checks:_nomad-task-bb39ddd4-951b-52d2-12a7-8bb290adebef-group-listener-<redacted>-proxy-redis- error="error filling agent cache: Internal cache failure: service '_nomad-task-bb39ddd4-951b-52d2-12a7-8bb290adebef-group-listener-<redacted>-proxy-redis-' not in agent state"

This is the only thing I can find that looks interesting.

# consul version
Consul v1.8.5
Revision 1e03567d3
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

I'd say there is a race condition between groups (I'm actually pretty positive there is) but this error happens with my redis task which is at the "bottom" on which everything else has a proxy connect to. ie: redis is the "server" proxy and does not have any "client" proxy

@shoenig
Copy link
Member

shoenig commented Nov 10, 2020

Can you take a look at the Nomad agent logs when this happens? There should be a log line with the stderr output from exec-ing consul binary with relevent error message. For example something like,

2020-11-10T14:40:33.171-0600 [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=8df94e24-253c-5fed-acde-63b42af168a8 task=connect-proxy-count-api error="exit status 1" stderr="This error message goes to stderr!"

@kneufeld
Copy link
Contributor Author

Found these lines

Nov 10 08:56:01 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:01.763-0800 [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-usher error="exit status 1" stderr="==> No sidecar proxy registered for _nomad-task-ad06b594-d5b0-331c-6c3f-4843b2070eb8-group-housekeeper-<redacted>-centre-proxy-usher-
Nov 10 08:56:01 clients-nomad1.west.example.com nomad[23271]: "
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.161-0800 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-redis error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.161-0800 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-redis reason="Restart within policy" delay=11.306497887s
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.584-0800 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-usher error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.585-0800 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-usher reason="Restart within policy" delay=11.306497887s

@shoenig
Copy link
Member

shoenig commented Nov 16, 2020

Hi @kneufeld , the -beta3 release contains #9330 which may help alleviate the issue, could you give that a try? Basically Nomad since 0.12 has been re-registering services with connect sidecars any time it needed to reconcile with Consul. On a busy cluster, this could mean Nomad is doing so continuously, possibly causing enough delays in registration of new services to cause the symptoms you are seeing.

@kneufeld
Copy link
Contributor Author

kneufeld commented Nov 16, 2020

that's v1.0-beta3? And I assume servers and clients both need to be upgraded?

Is there a way to restart nomad on a client without restarting the docker containers? I don't want to restart nomad if that will cause the containers to migrate because that will cause an outage on any machine not running v1. And that's assuming the problem is fixed.

Thanks for the quick turn around time and follow up.

@kneufeld
Copy link
Contributor Author

kneufeld commented Nov 20, 2020

I migrated my clients off of nomad and into their own vms and started with docker-compose.

This allowed me to upgrade my west datacenter to to v1.0-beta3. My east dc still has clients on 0.12 but the servers over there have been upgraded.

First tests looked good but then I started getting the same bootstrap errors. I may have found an other piece of the puzzle

Evaluations area insta failing.

nomad plan job.nomad
nomad job run -check-index 0 job.nomad

Then in hashi-ui I see the following instantly:

status: failed
statusdesc: maximum attempts reached (5)
triggeredby: job-registered

followed by a never ending stream of:

status: failed
statusdesc: maximum attempts reached (5)
triggeredby: max-plan-attempts

and

Status:    blocked
StatusDescription:     created due to placement conflicts
TriggeredBy:    max-plan-attempts

but I kinda call BS on placement conflicts since the original nomad plan passed and I have more nodes in my west dc than tasks and no constraints in my job file.

Is it possible that my cluster is just corrupt somehow? If so then how can nomad be trusted?

My final hope is to migrate the rest of my clients off nomad in the east dc and re-bootstrap the entire cluster. I have zero doubts that will work at first but long term?

ps. I've nomad stop -purge several times

@kneufeld
Copy link
Contributor Author

@shoenig I got the rest of my jobs off the cluster and rebuilt it. My quick test works, tasks scale etc.

I really don't know what to think...

@shoenig
Copy link
Member

shoenig commented Nov 23, 2020

Hi @kneufeld sorry this has caused so much trouble. The Nomad v0.12 release was centered around major underlying network architecture changes, and we've been squashing followup bugs in the point releases since. Though we hope for a smooth transition, the changes were made before Nomad v1.0 specifically so we could provide the stability guarantees of a 1.0 release going forward.

It sounds like the rebuilt cluster is operating okay?

I suspect the evaluation failures may have been caused by the problem described in #9356. The fix for that isn't present in the -beta3, but should be coming in a release soon.

@kneufeld
Copy link
Contributor Author

@shoenig I'm only running a single job but I never saw any issues until I tried to run many jobs. I do think the cluster was somehow corrupted which is it's own very troubling thought.

I'm opening another bug that I saw pre-rebuild and post-rebuild with only a single job running.

@kneufeld
Copy link
Contributor Author

kneufeld commented Dec 1, 2020

@shoenig I just saw the same insta-blocked deployments/evaulations with only a single job in the cluster. I shutdown the leader to force an election and once a new leader was chosen I could successfully deploy a job.

This was after stopping all servers and clients, deleting /var/lib/nomad. Aka, I rebuilt a brand new cluster. Could there be some lingering schmoo in consul that's breaking everything?

@kneufeld
Copy link
Contributor Author

kneufeld commented Dec 1, 2020

@shoenig I just received the same

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1

error after starting 3 jobs in my cluster. This bug is not remotely fixed. I emailed my job file in issue #9431

@Legogris
Copy link

Legogris commented Jan 20, 2021

I'm getting the above behavior when attempting to use Connect on an existing cluster (Consul 1.9.1, Nomad 1.0.2, ACLs and TLS enabled). No errors in log about bootstrap error for Connect though, so maybe it's a similar manifestation of something else.

FWIW I can't get Connenct to work at all (using the Counter Dashboard example) even when making sure node is warm before submitting a single job. Nothing stands out in logs for consul or nomad, or the envoy containers, apart from Using deprecated option ... warnings in Envoy (tried with 1.13,1.14,1.15).

@holtwilkins
Copy link
Contributor

I've seen this error in particular alongside the generated Nomad ACL tokens. Namely, what seems to happen is that as part of creating a new allocation, Nomad spins-up a service Consul token for the sidecar, and then immediately tries to use it. This token hasn't propagated quite yet into Consul, so I see in the Nomad logs some 403 ACL token not found errors, then eventually when the alloc keeps retrying, as it keeps using the same service token, it eventually becomes valid and the allocation succeeds.

So, it seems like the sidecar needs to wait a little bit more or have some retries built-in at an earlier phase rather than failing the whole allocation immediately and relying on the general allocation restart behaviour to sort it out.

@Legogris
Copy link

Sounds different from what I am experiencing here (though I think I've absolutely seen what you mention as well before).

All checks passing, services considered healthy, no errors in either consul or nomad

@shevagit
Copy link

We're getting the same error, without using ACL tokens, connect fails sometimes causing the tasks to fail.

"envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"

@betterthanbreakfast
Copy link

Hello, small follow-up:
We've got nomad 1.0.4 with TLS, but without ACL's atm.
We've run into same issue, job fails during startup with "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"

In our case, reason was in service name, name contains UPPERCASE characters. After transform em to lowercase, problem was solved.

@Legogris
Copy link

Legogris commented Apr 22, 2021

Got hit by this again now when updating an existing job (the only thing changed: no-op change inside a generated template)

Stopping the job and starting it again:

[ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=foobar task=connect-proxy-deadbeef error="exit status 1" stderr="==> No proxy ID specified. One of -proxy-id, -sidecar-for, or -gateway is required
[ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=foobar task=connect-proxy-deadbeef error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"

Tried the following, does not resolve it:

  • Stopping the job
  • Draining the client
  • Stopping nomad on the client
  • Doing a full reboot of the client instance
  • doing a GC (making sure the stopped job is forgotten)
  • Starting nomad on the client again
  • Scheduling a new instance of the job

Same sidecar proxy (there are 4 in total on the group) keeps failing with the same errors.

Consul logs around the same time:

[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.35:8300 error="rpc error making call: rpc error making call: Permission denied"
[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.34:8300 error="rpc error making call: Permission denied"
[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.33:8300 error="rpc error making call: rpc error making call: Permission denied"
[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.35:8300 error="rpc error making call: rpc error making call: Permission denied"
[ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-foobar-group-deadbeef-1234-sidecar-proxy id=service-http-checks:_nomad-task-foobar-group-deadbeef-1234 error="error filling agent cache: Internal cache failure: service '_nomad-task-foobar-group-deadbeef-1234' not in agent state"
[ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-foobar-group-deadbeef-1234-sidecar-proxy id=service-http-checks:_nomad-task-foobar-group-deadbeef-1234 error="error filling agent cache: Internal cache failure: service '_nomad-task-foobar-group-deadbeef-1234' not in agent state"

@Legogris
Copy link

Legogris commented Apr 22, 2021

Stopped all other jobs related to the failing sidecar container and performed a rolling restart of all consul servers and nomad servers and rescheduled - still same.

Observation: When draining the node, the 3 other sidecar tasks for the group with the failing one are stuck in running despite the main tasks being successfully killed. First time I see a drain taking more than a few seconds. Only doing a manual stop successfully stops the job fully.

In the "zombie" sidecar tasks I see in stderr after the drain (apart from all the deprecation warnings):

[info][upstream] [source/common/upstream/cds_api_impl.cc:64] cds: add 1 cluster(s), remove 1 cluster(s)

@Legogris
Copy link

@Legogris
Copy link

Legogris commented Apr 22, 2021

Arbitrarily changing the service name (amending intentions accordingly) made it come back up again. So seems to be the exact same thing @flytocolors was having (the casing for them is most likely coincidental though as that was not a factor here)

@shoenig
Copy link
Member

shoenig commented Apr 26, 2021

@Legogris are you using Ingress Gateways, and on which version of Nomad? We fixed a bug that would trigger that error in Nomad v1.0.4. If we're still running into that error with just normal sidecars we can investigate further. In particular this log line (with DEBUG enabled) should shed some light on what's happening.

@watsonian watsonian added the hcc/cst Admin - internal label Sep 22, 2021
@jhitt25
Copy link

jhitt25 commented Feb 21, 2022

Not to "me too" an issue, but im getting this sporadically as well. We're on consul 1.11.2 and nomad 1.2.4, using ACLs for both. Our cluster is running just fine and i can even stop/start/restart existing jobs, but I have a new job that is refusing to start right now. This job has been running off and on for the past week as i tweak things for our environment, but as of this morning it is consistently failing with this exact scenario.

We're testing a 2 node keycloak cluster in Nomad/Consul Connect with a Treafik frontend. The goal is to have 3 ports - service, health check, and jgroups (for the JBoss clustering). The jgroups port is outside connect, and the health check is a manual 'expose' because it runs on a different port than the service. This has worked fine for going on 2 weeks, but this morning started in with the same issue described here. I don't know if some facet of what we're doing adds any new information, but i figured it couldn't hurt.

I've attached our job file along with the excerpts of the consul and nomad logs as well if it can help.

nomad-job.txt
consul-log.txt
nomad-job.txt

@jhitt25
Copy link

jhitt25 commented Feb 22, 2022

As an update to my addition, i was able to get past the issue by moving my non-connect'ed service (for jgroups) from the group level into the task level.

@schemmerling
Copy link

schemmerling commented Mar 19, 2022

Well, I will intentionally leave a "me too" here - Note the following error message:
Mar 19 16:51:28 TEST sh[17233]: 2022-03-19T16:51:28.860-0600 [ERROR] agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=global service_id=global id=service-http-checks:hello error="error filling agent cache: Internal cache failure: service 'hello' not in agent state"
Consul v1.11.4

@tgross
Copy link
Member

tgross commented May 10, 2024

Leaving a note here based on a conversation I had with @schmichael and some folks from Consul recently. We may be running into the issue fixed for the Consul K8s plugin here: hashicorp/consul-k8s#887 Effectively we're creating the token and then the agent may have a stale view of the world when we use that token. Unfortunately this PR isn't something we can just drop into the Nomad code base because the architecture is a good bit different, but that's a promising lead for when this issue gets revisited next.

@tgross tgross self-assigned this Jun 14, 2024
@tgross
Copy link
Member

tgross commented Jun 14, 2024

I've got an early prototype branch b-consul-token-read-self-on-login that covers the approach described above. I'll be working on getting that completed to cover the legacy workflow next, and then figuring out a way that we can validate that the fix is useful.

Internal ref: https://hashicorp.atlassian.net/browse/NET-10051

tgross added a commit that referenced this issue Jun 20, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
tgross added a commit that referenced this issue Jun 21, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
@tgross tgross moved this to In Progress in Nomad - Community Issues Triage Jun 24, 2024
tgross added a commit that referenced this issue Jun 26, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
@tgross tgross closed this as completed in df67e74 Jun 27, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Jun 27, 2024
@tgross
Copy link
Member

tgross commented Jun 27, 2024

I've just merged #23381 which hopefully should close out this issue. That's planned for release in Nomad 1.8.2 (with backports to Nomad Enterprise 1.7.x and 1.6.x). Once you've deployed that, if you're still running into issues with Envoy bootstrap, please report a new issue after having gone through the troubleshooting guides in Nomad's Service Mesh troubleshooting and Resolving Common Errors in Envoy Proxy. That'll help us separate out additional problems above-and-beyond the ones we've identified here.

@tgross tgross added this to the 1.8.2 milestone Jun 27, 2024
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Development

Successfully merging a pull request may close this issue.