Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openvpn deescalates privileges which causes a hard failure on reconnect to different endpoint #1779

Open
xginn8 opened this issue Dec 18, 2019 · 19 comments
Assignees

Comments

@xginn8
Copy link
Contributor

xginn8 commented Dec 18, 2019

In 12205bf we configured openvpn to de-escalate privileges and run as the openvpn user & group.

As noted in the openvpn docs, this de-escalation causes a hard failure (https://openvpn.net/community-resources/reference-manual-for-openvpn-2-0/):

Note the following corner case: If you use multiple –remote options, AND you are dropping root privileges on the client with –user and/or –group, AND the client is running a non-Windows OS, if the client needs to switch to a different server, and that server pushes back different TUN/TAP or route settings, the client may lack the necessary privileges to close and reopen the TUN/TAP interface. This could cause the client to exit with a fatal error.

The daemon crashes with the following error after it receives a different address PUSHed to it from the openvpn server:

Dec 17 23:15:15 XXXXXX openvpn[946]: Tue Dec 17 23:15:15 2019 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)

Specifically, this issue is related to #1776 as upon the ping-restart described in the ticket, the remote PUSHes a new address which the daemon cannot apply.

cc @wrboyce @afitzek

@balena-ci
Copy link
Contributor

[xginn8] This issue has attached support thread https://jel.ly.fish/#/support-thread~b16dd5c1-4ef9-4972-8dac-ce50ea9dcaf6

@jellyfish-bot
Copy link

[saintaardvark] This issue has attached support thread https://jel.ly.fish/38ab2df4-55c2-4171-8ac0-485ba067438d

@markcorbinuk
Copy link
Contributor

This issue occurred overnight last night on my RPI3 - strangely enough it was at 41 minutes past the hour as previously reported. Log file here --> openvpn-unit-journal.txt

Don't know why the network is dropping, but managed to recreate the openvpn behaviour by temporarily blocking incoming traffic from port 443 using an iptables rule.

@jellyfish-bot
Copy link

[xginn8] This issue has attached support thread https://jel.ly.fish/50a38ebb-7b5d-4708-ae46-5230910ec13e

@alexgg
Copy link
Contributor

alexgg commented Mar 15, 2021

This should have been fixed with the merge of #2014 in v2.60. Please only re-open if this same problem is reported above that version.

@alexgg alexgg closed this as completed Mar 15, 2021
@markcorbinuk
Copy link
Contributor

The merge of #2014 will only fix this for instantaneous reboots where the outage time is <60 seconds

@markcorbinuk markcorbinuk reopened this Mar 18, 2021
@jellyfish-bot
Copy link

[majorz] This issue has attached support thread https://jel.ly.fish/1b57a2f7-e2b2-4658-94ef-0a35bef04f4b

@jellyfish-bot
Copy link

[majorz] This issue has attached support thread https://jel.ly.fish/78547810-74ac-4ae3-b854-60727ac077c0

@majorz
Copy link
Contributor

majorz commented Jan 27, 2022

I investigated a couple of instances of the ioctl TUNSETIFF. It happens quite frequently when a VPN connection is dropped between the device and our servers and the OpenVPN client tries to reconnect. When it succeeds to reestablish connection to our servers, the issue occurs, then the client restarts itself and afterwards on the next attempt everything is fine. So it is not such a severe issue, but it still has to be solved since that restart should not happen.

@20k-ultra
Copy link
Contributor

20k-ultra commented Apr 6, 2022

Found a rpi4 with Host OS version balenaOS 2.95.8 exhibiting this issue

Apr 06 19:50:50 xxxxxxx openvpn[1806517]: Wed Apr  6 19:50:50 2022 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Apr 06 19:50:50 xxxxxxx openvpn[1806517]: Wed Apr  6 19:50:50 2022 Exiting due to fatal error
Apr 06 19:50:50 xxxxxxx systemd[1]: openvpn.service: Main process exited, code=exited, status=1/FAILURE
Apr 06 19:50:50 xxxxxxx systemd[1]: openvpn.service: Failed with result 'exit-code'.

restarting the vpn service does not resolve this issue.

@jellyfish-bot
Copy link

[rhampt] This issue has attached support thread https://jel.ly.fish/c550cc88-af96-4e61-b5e8-a81dd0f47f07

@majorz
Copy link
Contributor

majorz commented Jan 17, 2023

Note that this issue is probably causing balena-io/open-balena-vpn#313, so it is more severe than initially thought.

When OpenVPN is started it starts as root user. After it initializes and connects to the server it drops privileges and runs as openvpn user.

If for some reason the VPN connection stales and a server ping is not received for 60 seconds, the client will try to reestablish connection to the server after 5 seconds:

[vpn.balena-cloud.com] Inactivity timeout (--ping-restart), restarting
TCP/UDP: Closing socket
SIGUSR1[soft,ping-restart] received, process restarting
Restart pause, 5 second(s)
Re-using SSL/TLS context
...

As the process itself is not really restarted and does not exit, it fails in recreating the tun interface because it does not run as root at that point:

...
PUSH: Received control message: 'PUSH_REPLY,sndbuf 0,rcvbuf 0,route 52.4.252.97,ping 10,ping-restart 60,socket-flags TCP_NODELAY,ifconfig 10.242.26.155 52.4.252.97,peer-id 0,cipher AES-128-GCM'
...
NOTE: Pulled options changed on restart, will need to close and reopen TUN/TAP device.
...
Closing TUN/TAP interface
...
ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Exiting due to fatal error
Main process exited, code=exited, status=1/FAILURE
openvpn.service: Failed with result 'exit-code'.

The --ping-restart option is being pushed by the server to the client (see above first line). If instead the server pushes ping-exit the process will just terminate and will be restarted by systemd. When it starts, it will start as root and it will not run into the same problem.

If such a change could not be made on the server side currently, the alternative to this is to remap the SIGUSR1 signal to SIGTERM by passing --remap-usr1 SIGTERM to the client arguments in openvpn.service. In that case the process will exit instead of trying to reinstate the connection:

[vpn.balena-cloud.com] Inactivity timeout (--ping-restart), restarting
...
Closing TUN/TAP interface
...
SIGTERM[soft,ping-restart] received, process exiting

If the change is done on the server side, currently deployed devices will no longer incorrectly report heartbeat only mode. If the change is done on the OS side, the problem will be solved for devices running newer OS version.

Other methods also exist for addressing this problem (https://community.openvpn.net/openvpn/wiki/UnprivilegedUser), but will require a lot more substantial changes both on the client and server side, which includes adjusting openvpn.conf on the client side. Since openvpn.conf is currently retrieved online by os-config, that will make this even more difficult as we have to preserve backwards compatibility with it.

@jellyfish-bot
Copy link

[thgreasi] This has attached https://jel.ly.fish/e7abfa7a-59e7-4326-ae22-1d5c77ef7348

@klutchell
Copy link
Collaborator

Resolved by balena-io/open-balena-vpn#314

open-balena-vpn v11.19.0 is now in balenaCloud production

@majorz majorz reopened this Mar 2, 2023
@majorz
Copy link
Contributor

majorz commented Mar 2, 2023

We have seen a new instance of this error - this time not as severe, but we will have to fix it on the OS side this time.

The VPN connection was reset for some unknown reason (previously we did not receive ping messages from the server).

Mar 01 14:39:37 b65a222 openvpn[2640]: Wed Mar  1 14:39:37 2023 Connection reset, restarting [0]
Mar 01 14:39:37 b65a222 openvpn[2640]: Wed Mar  1 14:39:37 2023 /etc/openvpn-misc/downscript.sh resin-vpn 1500 1555 10.246.107.185 52.4.252.97 restart
Mar 01 14:39:37 b65a222 openvpn[2640]: Wed Mar  1 14:39:37 2023 SIGUSR1[soft,connection-reset] received, process restarting

The solution for fixing this scenario was explained in the previous message: this is to remap the SIGUSR1 signal to SIGTERM by passing --remap-usr1 SIGTERM to the client arguments in openvpn.service.

@majorz
Copy link
Contributor

majorz commented Mar 7, 2023

Encountered another instance of this, but this time leading the VPN unavailability, so I will look into fixing this with more priority:

Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar  7 09:41:49 2023 Connection reset, restarting [0]
Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar  7 09:41:49 2023 /etc/openvpn-misc/downscript.sh resin-vpn 1500 1555 10.241.127.118 52.4.252.97 restart
Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar  7 09:41:49 2023 SIGUSR1[soft,connection-reset] received, process restarting
Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar  7 09:41:49 2023 Restart pause, 5 second(s)
...
Mar 07 09:41:58 f450cbf openvpn[6998]: Tue Mar  7 09:41:58 2023 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Mar 07 09:41:58 f450cbf openvpn[6998]: Tue Mar  7 09:41:58 2023 Exiting due to fatal error
Mar 07 09:41:58 f450cbf systemd[1]: openvpn.service: Main process exited, code=exited, status=1/FAILURE
Mar 07 09:41:58 f450cbf systemd[1]: openvpn.service: Failed with result 'exit-code'.

@jellyfish-bot
Copy link

[majorz] This has attached https://jel.ly.fish/3b115ca6-f2ab-4ffc-a11c-54811547eb15

@majorz
Copy link
Contributor

majorz commented Mar 7, 2023

We may attempt to do a push "remap-usr1 SIGTERM" on the server side, similarly to how we handled ping-exit.

@majorz
Copy link
Contributor

majorz commented Aug 11, 2023

Remapping this may have possible side-effects, so may or may not be a good solution. Probably not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants