Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charm temporarily goes to Error during deployment while handling config-changed, but eventually proceeds to Active #337

Closed
ca-scribner opened this issue Nov 17, 2023 · 14 comments
Labels
bug Something isn't working

Comments

@ca-scribner
Copy link

Steps to reproduce

This might be reproducable in a smaller test suite, but I've only encountered it during the kfp-api charm's integration tests (tox -e integration -- --model kubeflow). I tried reproducing it by deploying only mysql-k8s, but did not see the same issue. Maybe there's a cpu-load driven race condition?

Expected behavior

The charm progresses to Active without encountering Error

Actual behavior

The charm hits Error during a config-changed hook while deploying. Over time, this error goes away and the charm proceeds to be Active

While the charm does eventually get Active, going into Error on the way feels like there's a foreseeable failure that's uncaught. Plus, in my use case, it resulted in flakey CI because we typically label any Error state as a test fail and end the test right away,

Versions

Operating system:

Juju CLI: 3.1.6

Juju agent: 3.1.6

Charm revision: 8.0/edge

microk8s: 1.25-strict

Log output

kubectl pod logs from a local run ``` kubectl logs mysql-k8s-0 -c charm [3:41:27] 2023-11-17T20:40:57.676Z [pebble] HTTP API server listening on ":38812". 2023-11-17T20:40:57.676Z [pebble] Started daemon. 2023-11-17T20:40:57.682Z [pebble] POST /v1/services 4.454317ms 202 2023-11-17T20:40:57.682Z [pebble] Started default services with change 1. 2023-11-17T20:40:57.685Z [pebble] Service "container-agent" starting: /charm/bin/containeragent unit --data-dir /var/lib/juju --append-env "PATH=$PATH:/charm/bin" --show-log --charm-modified-version 0 2023-11-17T20:40:57.746Z [container-agent] 2023-11-17 20:40:57 INFO juju.cmd supercommand.go:56 running containerAgent [3.1.6 f6a66aa91eec620f5ac04a19d8c06bef03ae6228 gc go1.20.8] 2023-11-17T20:40:57.747Z [container-agent] starting containeragent unit command 2023-11-17T20:40:57.747Z [container-agent] containeragent unit "unit-mysql-k8s-0" start (3.1.6 [gc]) 2023-11-17T20:40:57.747Z [container-agent] 2023-11-17 20:40:57 INFO juju.cmd.containeragent.unit runner.go:578 start "unit" 2023-11-17T20:40:57.747Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.upgradesteps worker.go:60 upgrade steps for 3.1.6 have already been run. 2023-11-17T20:40:57.749Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.probehttpserver server.go:157 starting http server on 127.0.0.1:65301 2023-11-17T20:40:57.775Z [container-agent] 2023-11-17 20:40:57 INFO juju.api apiclient.go:1073 cannot resolve "controller-service.controller-local-microk8s.svc.cluster.local": lookup controller-service.controller-local-microk8s.svc.cluster.local: operation was canceled 2023-11-17T20:40:57.775Z [container-agent] 2023-11-17 20:40:57 INFO juju.api apiclient.go:706 connection established to "wss://10.152.183.27:17070/model/ef974c79-cbe5-4b53-81ca-d316c305a28d/api" 2023-11-17T20:40:57.779Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.apicaller connect.go:163 [ef974c] "unit-mysql-k8s-0" successfully connected to "10.152.183.27:17070" 2023-11-17T20:40:57.801Z [container-agent] 2023-11-17 20:40:57 INFO juju.api apiclient.go:1073 cannot resolve "controller-service.controller-local-microk8s.svc.cluster.local": lookup controller-service.controller-local-microk8s.svc.cluster.local: operation was canceled 2023-11-17T20:40:57.802Z [container-agent] 2023-11-17 20:40:57 INFO juju.api apiclient.go:706 connection established to "wss://10.152.183.27:17070/model/ef974c79-cbe5-4b53-81ca-d316c305a28d/api" 2023-11-17T20:40:57.805Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.apicaller connect.go:163 [ef974c] "unit-mysql-k8s-0" successfully connected to "10.152.183.27:17070" 2023-11-17T20:40:57.829Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.migrationminion worker.go:142 migration phase is now: NONE 2023-11-17T20:40:57.829Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.logger logger.go:120 logger worker started 2023-11-17T20:40:57.834Z [container-agent] 2023-11-17 20:40:57 WARNING juju.worker.proxyupdater proxyupdater.go:241 unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: "" 2023-11-17T20:40:57.843Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.leadership tracker.go:194 mysql-k8s/0 promoted to leadership of mysql-k8s 2023-11-17T20:40:57.849Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.caasupgrader upgrader.go:113 abort check blocked until version event received 2023-11-17T20:40:57.849Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.caasupgrader upgrader.go:119 unblocking abort check 2023-11-17T20:40:57.856Z [container-agent] 2023-11-17 20:40:57 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-k8s-0 2023-11-17T20:40:57.876Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.uniter uniter.go:363 unit "mysql-k8s/0" started 2023-11-17T20:40:57.879Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.uniter uniter.go:689 resuming charm install 2023-11-17T20:40:57.881Z [container-agent] 2023-11-17 20:40:57 INFO juju.worker.uniter.charm bundles.go:81 downloading ch:amd64/jammy/mysql-k8s-110 from API server 2023-11-17T20:40:57.881Z [container-agent] 2023-11-17 20:40:57 INFO juju.downloader download.go:109 downloading from ch:amd64/jammy/mysql-k8s-110 2023-11-17T20:40:58.063Z [container-agent] 2023-11-17 20:40:58 INFO juju.downloader download.go:92 download complete ("ch:amd64/jammy/mysql-k8s-110") 2023-11-17T20:40:58.187Z [container-agent] 2023-11-17 20:40:58 INFO juju.downloader download.go:172 download verified ("ch:amd64/jammy/mysql-k8s-110") 2023-11-17T20:41:07.684Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 418 2023-11-17T20:41:08.652Z [container-agent] 2023-11-17 20:41:08 INFO juju.worker.uniter uniter.go:389 hooks are retried true 2023-11-17T20:41:08.740Z [container-agent] 2023-11-17 20:41:08 INFO juju.worker.uniter resolver.go:165 found queued "install" hook 2023-11-17T20:41:10.564Z [container-agent] 2023-11-17 20:41:10 INFO juju-log Running legacy hooks/install. 2023-11-17T20:41:12.196Z [container-agent] 2023-11-17 20:41:12 INFO juju.worker.uniter.operation runhook.go:186 ran "install" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:13.549Z [container-agent] 2023-11-17 20:41:13 INFO juju.worker.uniter.operation runhook.go:186 ran "restart-relation-created" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:14.699Z [container-agent] 2023-11-17 20:41:14 INFO juju.worker.uniter.operation runhook.go:186 ran "upgrade-relation-created" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:15.817Z [container-agent] 2023-11-17 20:41:15 INFO juju.worker.uniter.operation runhook.go:186 ran "database-peers-relation-created" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:15.840Z [container-agent] 2023-11-17 20:41:15 INFO juju.worker.uniter resolver.go:165 found queued "leader-elected" hook 2023-11-17T20:41:17.509Z [container-agent] 2023-11-17 20:41:17 INFO juju.worker.uniter.operation runhook.go:186 ran "leader-elected" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:17.679Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 418 2023-11-17T20:41:18.961Z [container-agent] 2023-11-17 20:41:18 INFO juju.worker.uniter.operation runhook.go:186 ran "database-storage-attached" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:19.746Z [container-agent] 2023-11-17 20:41:19 ERROR juju-log Uncaught exception while in charm code: 2023-11-17T20:41:19.746Z [container-agent] Traceback (most recent call last): 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open 2023-11-17T20:41:19.746Z [container-agent] h.request(req.get_method(), req.selector, req.data, headers, 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1283, in request 2023-11-17T20:41:19.746Z [container-agent] self._send_request(method, url, body, headers, encode_chunked) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request 2023-11-17T20:41:19.746Z [container-agent] self.endheaders(body, encode_chunked=encode_chunked) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders 2023-11-17T20:41:19.746Z [container-agent] self._send_output(message_body, encode_chunked=encode_chunked) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output 2023-11-17T20:41:19.746Z [container-agent] self.send(msg) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 976, in send 2023-11-17T20:41:19.746Z [container-agent] self.connect() 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 272, in connect 2023-11-17T20:41:19.746Z [container-agent] self.sock.connect(self.socket_path) 2023-11-17T20:41:19.746Z [container-agent] FileNotFoundError: [Errno 2] No such file or directory 2023-11-17T20:41:19.746Z [container-agent] 2023-11-17T20:41:19.746Z [container-agent] During handling of the above exception, another exception occurred: 2023-11-17T20:41:19.746Z [container-agent] 2023-11-17T20:41:19.746Z [container-agent] Traceback (most recent call last): 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1599, in _request_raw 2023-11-17T20:41:19.746Z [container-agent] response = self.opener.open(request, timeout=self.timeout) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 519, in open 2023-11-17T20:41:19.746Z [container-agent] response = self._open(req, data) 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 536, in _open 2023-11-17T20:41:19.746Z [container-agent] result = self._call_chain(self.handle_open, protocol, protocol + 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain 2023-11-17T20:41:19.746Z [container-agent] result = func(*args) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 286, in http_open 2023-11-17T20:41:19.746Z [container-agent] return self.do_open(_UnixSocketConnection, req, # type:ignore 2023-11-17T20:41:19.746Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open 2023-11-17T20:41:19.746Z [container-agent] raise URLError(err) 2023-11-17T20:41:19.746Z [container-agent] urllib.error.URLError: 2023-11-17T20:41:19.746Z [container-agent] 2023-11-17T20:41:19.746Z [container-agent] During handling of the above exception, another exception occurred: 2023-11-17T20:41:19.746Z [container-agent] 2023-11-17T20:41:19.746Z [container-agent] Traceback (most recent call last): 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/./src/charm.py", line 786, in 2023-11-17T20:41:19.746Z [container-agent] main(MySQLOperatorCharm) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/main.py", line 441, in main 2023-11-17T20:41:19.746Z [container-agent] _emit_charm_event(charm, dispatcher.event_name) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/main.py", line 149, in _emit_charm_event 2023-11-17T20:41:19.746Z [container-agent] event_to_emit.emit(*args, **kwargs) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 342, in emit 2023-11-17T20:41:19.746Z [container-agent] framework._emit(event) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 839, in _emit 2023-11-17T20:41:19.746Z [container-agent] self._reemit(event_path) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 928, in _reemit 2023-11-17T20:41:19.746Z [container-agent] custom_handler(event) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/./src/charm.py", line 412, in _on_config_changed 2023-11-17T20:41:19.746Z [container-agent] if not self._mysql.is_mysqld_running(): 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/src/mysql_k8s_helpers.py", line 564, in is_mysqld_running 2023-11-17T20:41:19.746Z [container-agent] return self.container.exists(MYSQLD_SOCK_FILE) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/model.py", line 2444, in exists 2023-11-17T20:41:19.746Z [container-agent] self._pebble.list_files(str(path), itself=True) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 2104, in list_files 2023-11-17T20:41:19.746Z [container-agent] resp = self._request('GET', '/v1/files', query) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1564, in _request 2023-11-17T20:41:19.746Z [container-agent] response = self._request_raw(method, path, query, headers, data) 2023-11-17T20:41:19.746Z [container-agent] File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1612, in _request_raw 2023-11-17T20:41:19.746Z [container-agent] raise ConnectionError(e.reason) 2023-11-17T20:41:19.746Z [container-agent] ops.pebble.ConnectionError: [Errno 2] No such file or directory 2023-11-17T20:41:20.037Z [container-agent] 2023-11-17 20:41:20 ERROR juju.worker.uniter.operation runhook.go:180 hook "config-changed" (via hook dispatching script: dispatch) failed: exit status 1 2023-11-17T20:41:20.039Z [container-agent] 2023-11-17 20:41:20 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2023-11-17T20:41:20.379Z [container-agent] 2023-11-17 20:41:20 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2023-11-17T20:41:21.659Z [container-agent] 2023-11-17 20:41:21 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2023-11-17T20:41:25.050Z [container-agent] 2023-11-17 20:41:25 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook 2023-11-17T20:41:26.123Z [container-agent] 2023-11-17 20:41:26 INFO juju.worker.uniter.operation runhook.go:186 ran "config-changed" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:26.134Z [container-agent] 2023-11-17 20:41:26 INFO juju.worker.uniter resolver.go:165 found queued "start" hook 2023-11-17T20:41:26.859Z [container-agent] 2023-11-17 20:41:26 INFO juju-log Running legacy hooks/start. 2023-11-17T20:41:27.679Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 418 2023-11-17T20:41:27.679Z [pebble] Check "readiness" failure threshold 3 hit, triggering action 2023-11-17T20:41:27.996Z [container-agent] 2023-11-17 20:41:27 INFO juju.worker.uniter.operation runhook.go:186 ran "start" hook (via hook dispatching script: dispatch) 2023-11-17T20:41:28.768Z [container-agent] 2023-11-17 20:41:28 INFO juju-log Setting up the logrotate configurations 2023-11-17T20:41:36.436Z [container-agent] 2023-11-17 20:41:36 INFO juju-log Configuring instance ```

github runner test failure for same issue

@ca-scribner ca-scribner added the bug Something isn't working label Nov 17, 2023
Copy link
Contributor

@taurus-forever
Copy link
Contributor

The copy of the log: https://pastebin.canonical.com/p/WxDhkGWwkp/ (runner will be cleaned during the holiday).

@RafalSiwek
Copy link
Contributor

RafalSiwek commented Jan 11, 2024

I stumbled upon this error when running integration tests in my project

After ssh'ing into the runners and running kubectl logs mysql-k8s -c charm I found the error cause:

Logs snippet:

2024-01-11T16:48:16.462Z [container-agent] 2024-01-11 16:48:16 DEBUG jujuc server.go:222 running hook tool "juju-log" for mysql-k8s/0-config-changed-7837649463538750024
2024-01-11T16:48:16.462Z [container-agent] 2024-01-11 16:48:16 ERROR juju-log Uncaught exception while in charm code:
2024-01-11T16:48:16.462Z [container-agent] Traceback (most recent call last):
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/./src/charm.py", line 786, in <module>
2024-01-11T16:48:16.462Z [container-agent]     main(MySQLOperatorCharm)
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/main.py", line 436, in main
2024-01-11T16:48:16.462Z [container-agent]     _emit_charm_event(charm, dispatcher.event_name)
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
2024-01-11T16:48:16.462Z [container-agent]     event_to_emit.emit(*args, **kwargs)
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 351, in emit
2024-01-11T16:48:16.462Z [container-agent]     framework._emit(event)
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 853, in _emit
2024-01-11T16:48:16.462Z [container-agent]     self._reemit(event_path)
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/framework.py", line 942, in _reemit
2024-01-11T16:48:16.462Z [container-agent]     custom_handler(event)
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/./src/charm.py", line 412, in _on_config_changed
2024-01-11T16:48:16.462Z [container-agent]     if not self._mysql.is_mysqld_running():
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/src/mysql_k8s_helpers.py", line 564, in is_mysqld_running
2024-01-11T16:48:16.462Z [container-agent]     return self.container.exists(MYSQLD_SOCK_FILE)
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/model.py", line 2547, in exists
2024-01-11T16:48:16.462Z [container-agent]     self._pebble.list_files(str(path), itself=True)
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 2219, in list_files
2024-01-11T16:48:16.462Z [container-agent]     resp = self._request('GET', '/v1/files', query)
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1655, in _request
2024-01-11T16:48:16.462Z [container-agent]     response = self._request_raw(method, path, query, headers, data)
2024-01-11T16:48:16.462Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-k8s-0/charm/venv/ops/pebble.py", line 1704, in _request_raw
2024-01-11T16:48:16.462Z [container-agent]     raise ConnectionError(
2024-01-11T16:48:16.462Z [container-agent] ops.pebble.ConnectionError: Could not connect to Pebble: socket not found at '/charm/containers/mysql/pebble.socket' (container restarted?)
2024-01-11T16:48:16.712Z [container-agent] 2024-01-11 16:48:16 ERROR juju.worker.uniter.operation runhook.go:180 hook "config-changed" (via hook dispatching script: dispatch) failed: exit status 1
2024-01-11T16:48:16.712Z [container-agent] 2024-01-11 16:48:16 DEBUG juju.machinelock machinelock.go:206 created rotating log file "/var/log/machine-lock.log" with max size 10 MB and max backups 5
2024-01-11T16:48:16.713Z [container-agent] 2024-01-11 16:48:16 DEBUG juju.machinelock machinelock.go:190 machine lock "machine-lock" released for mysql-k8s/0 uniter (run config-changed hook)
2024-01-11T16:48:16.713Z [container-agent] 2024-01-11 16:48:16 DEBUG juju.worker.uniter.operation executor.go:118 lock released for mysql-k8s/0
2024-01-11T16:48:16.714Z [container-agent] 2024-01-11 16:48:16 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook
2024-01-11T16:48:16.714Z [container-agent] 2024-01-11 16:48:16 DEBUG juju.worker.uniter agent.go:22 [AGENT-STATUS] error: hook failed: "config-changed"
2024-01-11T16:48:19.684Z [container-agent] 2024-01-11 16:48:19 DEBUG juju.worker.uniter pebblepoller.go:99 pebble still starting up on container "mysql": socket "/charm/containers/mysql/pebble.socket" not found
2024-01-11T16:48:21.935Z [container-agent] 2024-01-11 16:48:21 DEBUG juju.worker.uniter.remotestate watcher.go:824 workloadEvent enqueued for mysql-k8s/0: 0
2024-01-11T16:48:21.935Z [container-agent] 2024-01-11 16:48:21 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook
2024-01-11T16:48:23.528Z [container-agent] 2024-01-11 16:48:23 DEBUG juju.worker.uniter.remotestate watcher.go:799 storage attachment change for mysql-k8s/0: {storage-database-0 {2 alive true /var/lib/juju/storage/database/0}}
2024-01-11T16:48:23.528Z [container-agent] 2024-01-11 16:48:23 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook
2024-01-11T16:48:24.502Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 418
2024-01-11T16:48:34.506Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 418
2024-01-11T16:48:34.506Z [pebble] Check "readiness" failure threshold 3 hit, triggering action
2024-01-11T16:48:39.993Z [container-agent] 2024-01-11 16:48:39 DEBUG juju.worker.uniter.remotestate watcher.go:799 storage attachment change for mysql-k8s/0: {storage-database-0 {2 alive true /var/lib/juju/storage/database/0}}
2024-01-11T16:48:39.993Z [container-agent] 2024-01-11 16:48:39 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook
2024-01-11T16:48:44.503Z [pebble] Check "readiness" failure 4 (threshold 3): received non-20x status code 418
2024-01-11T16:48:54.502Z [pebble] Check "readiness" failure 5 (threshold 3): received non-20x status code 418
2024-01-11T16:48:55.003Z [container-agent] 2024-01-11 16:48:55 DEBUG juju.worker.uniter.remotestate watcher.go:799 storage attachment change for mysql-k8s/0: {storage-database-0 {2 alive true /var/lib/juju/storage/database/0}}
2024-01-11T16:48:55.003Z [container-agent] 2024-01-11 16:48:55 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "config-changed" hook

It looked like the config-changed hook failed because the pebble operator could not find the socket file for the mysql container, and according to the logs, after the hook failed, it didn't auto-retry and awaited error resolution.
The same error was posted in this issue.

When I ssh'd into the mysql-k8s-x pod containers I was able to find the /charm/containers/mysql/pebble.socket file in the charm container and /charm/container/pebble.socket in the mysql container and after running juju resolved mysql-k8s/0 the charm went into Active state.
After running the experiments and observing the runner resource metrics I began to suspect that the reason for this behaviour may be the limited resources of agents and, consequently, the problem with the availability of sockets.

To keep my integration tests working, for my temporary solution I implemented a not-so-elegant solution for charm re-deployment when hitting this error which solved my issue. Still, I think it would be better to implement a solution for retrying the container connection in the source - so I am testing this.

@paulomach
Copy link
Contributor

paulomach commented Jan 15, 2024

Hi @RafalSiwek , what you are seeing is a symptom of the mysql container being OOMKilled, hence the pebble connection being lost.

I was checking your tests and saw that your bundle does not set any options for mysql-k8s charm.

The default behavior for memory allocation is somewhat grabby, which for testing on GH runners will most probably lead to unstable outcome.

So, my advice is either to have a dedicated test_bundle.yaml or setup the bundle configuration in the test, adding value to the profile option of the mysql-k8s charm, e.g.:

@@ -8,6 +8,8 @@
     scale: 1
     trust: true
     _github_repo_name: mysql-k8s-operator
+    options:
+      - profile: testing
   zenml-server:
     charm: zenml-server
     channel: latest/edge

The testing profile is very light weight, tuned for this use case

@paulomach
Copy link
Contributor

And by the way, it's encouraged to set limits using either k8s constraints or profiles on production deployments.

@RafalSiwek
Copy link
Contributor

Hi @paulomach, thanks for the suggestions!

As the issue persisted, I decided to conduct more tests, including setting the profile to "testing." Unfortunately, the problem persisted (you can review my testing path), resulting in the mysql container running without any recorded kills or restarts. Meanwhile, the charm is stuck in the config-changed hook error state, awaiting manual intervention.

Now, with evidence suggesting that the Pebble socket connection error is causing the charm to enter this hook error limbo state when calling the is_mysqld_running method, I'm considering whether incorporating error handling, timeouts, or a retry policy—especially for the testing profile—might be beneficial.

@paulomach
Copy link
Contributor

paulomach commented Jan 16, 2024

I believe this is due a config-changed hook being handled in between a leader-elected and pebble-ready events. I've proposed a change on @RafalSiwek PR#356 to address that.

@orfeas-k
Copy link

orfeas-k commented Jan 17, 2024

Note that it keeps failing consistently for us (kubeflow team) as well. You can see all 3 integration tests that deploy mysql-k8s have failed with the same error while we have constraints set.

@paulomach
Copy link
Contributor

Note that it keeps failing consistently for us (kubeflow team) as well. You can see all 3 integration tests that deploy mysql-k8s have failed with the same error while we have constraints set.

Hi @orfeas-k , I believe we got to the root cause now and tests are running for the PR.

@orfeas-k
Copy link

Great, thanks for the update. Let us know when the fix is published!

@paulomach
Copy link
Contributor

@orfeas-k @ca-scribner @RafalSiwek fix published on 8.0/edge (rev119)

Please give it a try to confirm and close this issue

@RafalSiwek
Copy link
Contributor

I tested the update on 4 concurrent runs (#1, #2, #3, #4), and all of them succeeded on the first try. It seems that this change has resolved the issue on my end.

@orfeas-k
Copy link

Tested and works! Thank you for implementing this. It would be great if you could let us know when this fix is published to stable so we can revert the changes we implemented.

P.S. @paulomach I don't seem to have permission to close this issue.

@paulomach
Copy link
Contributor

Tested and works! Thank you for implementing this. It would be great if you could let us know when this fix is published to stable so we can revert the changes we implemented.

P.S. @paulomach I don't seem to have permission to close this issue.

Thank you. We will let you know when promoted to stable.

rgildein added a commit to canonical/kfp-operators that referenced this issue Jun 12, 2024
Since the the issue [1] was merged and those changes are already in
8.0/stable we can use that channel again.

fixes: #401

---
[1]: canonical/mysql-k8s-operator#337
rgildein added a commit to canonical/kfp-operators that referenced this issue Jun 13, 2024
Since the the issue [1] was merged and those changes are already in
8.0/stable we can use that channel again.

fixes: #401

---
[1]: canonical/mysql-k8s-operator#337
rgildein added a commit to canonical/kfp-operators that referenced this issue Jun 13, 2024
Since the the issue [1] was merged and those changes are already in
8.0/stable we can use that channel again.

fixes: #401

---
[1]: canonical/mysql-k8s-operator#337
rgildein added a commit to canonical/kfp-operators that referenced this issue Jun 13, 2024
Since the the issue [1] was merged and those changes are already in
8.0/stable we can use that channel again.

fixes: #401

---
[1]: canonical/mysql-k8s-operator#337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants