You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have had reports that, when LXD is load-balanced via DNS SRV records, requests can be inconsistent and sometimes return Error: Not authorized. We found that the cause of this issue was that one cluster member did not have an up-to-date identity cache.
The synchronisation of the identity cache is very important because it is the means by which callers authenticate with the server, and contains the projects or groups that the caller has access to/is a member of. If, for example, a project is removed from the list of projects associated with a restricted client certificate, and one cluster member fails to update their cache, then requests to that member will continue to grant access to the project.
The current approach
Currently, all API handlers that modify the cache have the following steps:
Authentication (except in cases where a token is sent to authenticate a new client).
Authorization.
Validation (some validation may have already occurred in previous step).
Database change.
Internal notification to other cluster members via POST /internal/identity-cache-refresh. Only members that are "Alive" are sent the notification.
Local cache update.
Send lifecycle event.
Return.
Currently, step 5 returns an error only if a request fails to send to another member. The handler for POST /internal/identity-cache-refresh does not return an error, it only logs failures. Additionally, by only notifying "alive" members we will skip members whose heartbeat is lagging (the cut off for this is configurable via cluster.offline_threshold, and that do not currently have connectivity with the initiating member (checked via TLS handshake).
Also note that the identity cache is also updated on DQLite heartbeat:
On a restarted LXD cluster member (on initialisation).
When a new cluster member is added (member count change).
When any members address or online status changes.
The current approach was thought to be robust enough. Notifying live members should be sufficient because those that are not live will execute an identity cache refresh on heartbeat. However, there are clearly some issues.
Points of failure
Any node (including the initiator) may silently fail to update the identity cache. Since the cache update notification is sent concurrently, all members will attempt to read from the database at the same time. For large deployments, and since in DQLite all queries are sent to the leader, this can easily result in context timeouts.
Failure to send the notification to one or more members. In this case, the notification will cause the handler to error before the cache is updated on the initiator. If the notification fails to send to all members, the cache will be inconsistent with the database on all members. If the notification succeeds on some members, the cache will be valid on those members, but not on the initiator.
Possible solutions
These are just a few ideas I have. Please feel free to add more!
More rigorous notification strategy
We could change the notification policy when calling POST /internal/identity-cache-refresh to "NotifyAll". In this case, the notifier will bail early if not all members are online (before sending the notification). Additionally, we can return an error from the internal endpoint if the cache fails to update. With this change, the steps for API endpoints that modify the cache would be as follows:
Authentication (except in cases where a token is sent to authenticate a new client).
Authorization.
Validation (some validation may have already occurred in previous step).
Database change.
Internal notification to other cluster members via POST /internal/identity-cache-refresh. All members must be sent the notification.
i. On error, revert database changes and return an error to the caller.
Local cache update.
Send lifecycle event.
Return.
This approach is likely an improvement on the current situation, as it would bail early if not all members are considered "alive". However, the problem still remains that some members may successfully update their identity cache, and some may not. E.g. the initiator will return an error to the caller if the cache fails to update on one member, but other members may have completed this successfully using the contents of the database before the change was reverted.
Update cache on next request
We could change the behaviour of POST /internal/identity-cache-refresh to set a flag e.g. doUpdateCache in other members such that the identity cache is updated not immediately, but on the next request. This would allow us to change the flow of API endpoints that modify the cache:
Authentication (except in cases where a token is sent to authenticate a new client).
i. As part of authentication, check doUpdateCache flag and update cache if required. If the cache fails to update, return an error to the caller. If the cache updates successfully, unset the doUpdateCache flag and continue.
Authorization.
Validation (some validation may have already occurred in previous step).
Set cache update flag locally.
Internal notification to other cluster members via POST /internal/identity-cache-refresh. All members must be sent the notification.
i. On error, return an error to the caller.
Database change.
Send lifecycle event.
Local cache update.
i. On error, return an error to the caller that indicates that the change has been successfully made, but the member was not able to update it's cache.
ii. On success, unset the doUpdateCache flag.
Return.
In a deferred function, make a request to all other cluster members. This will cause them to refresh their identity cache so that the cost is not incurred to the next caller.
This approach is advantageous because it allows handling cluster notification errors before making any changes to the database. If the notification fails, it fails before any database changes are made. If the database change fails, the other members will still refresh their cache, but it will not have changed (i.e. no auth concerns). If the cache in the initiator fails to update, it will update before authenticating the next request.
However, there is a race condition here. If a request is sent concurrently to another member after the doUpdateCache flag is set (via notification), but before the database change is committed, then the other member's cache will have incorrect data and will not know to perform an update.
Enforce cache consistency via the database
The identity cache can hold a hash of its contents. If this hash is written to the database (location TBD). We can verify the hash before authenticating requests. The flow for API endpoints would look like:
Authentication (except in cases where a token is sent to authenticate a new client).
i. As part of authentication, check that the hash of the identity cache matches what is in the database and update cache if required. If the cache fails to update, return an error to the caller. If the cache updates successfully, set the new hash in the cache.
Authorization.
Validation (some validation may have already occurred in previous step).
Database change.
Send lifecycle event.
Local cache update.
i. On error, return an error to the caller that indicates that the change has been successfully made, but the member was not able to update it's cache.
ii. On success, set the new hash in the cache.
Return.
In a deferred function, make a request to all other cluster members. This will cause them to refresh their identity cache so that the cost is not incurred to the next caller.
This approach is advantageous because no cluster notifications are required (they can be used optionally to avoid incurring a cost to the next caller). However, this will require an extra database call for every API endpoint.
Conclusions
Of the three approaches, I think #3 is the safest and simplest. But this will increase latency across the whole API (especially for the non-leader). I'd also be happy to go with #2 but we'll need to consider how this might work with concurrent callers (e.g. something like terraform) especially when LXD is running behind a load-balancer/round-robin DNS.
For implementation, in both cases I think it will be best to move more logic into the cache itself to reduce duplication. We could have it function like a write-through cache to perform the database calls, as well as notification logic.
Happy to hear of any other solutions/ideas/thoughts on this. Thanks.
The text was updated successfully, but these errors were encountered:
We have had reports that, when LXD is load-balanced via DNS SRV records, requests can be inconsistent and sometimes return
Error: Not authorized
. We found that the cause of this issue was that one cluster member did not have an up-to-date identity cache.The synchronisation of the identity cache is very important because it is the means by which callers authenticate with the server, and contains the projects or groups that the caller has access to/is a member of. If, for example, a project is removed from the list of projects associated with a restricted client certificate, and one cluster member fails to update their cache, then requests to that member will continue to grant access to the project.
The current approach
Currently, all API handlers that modify the cache have the following steps:
POST /internal/identity-cache-refresh
. Only members that are "Alive" are sent the notification.Currently, step 5 returns an error only if a request fails to send to another member. The handler for
POST /internal/identity-cache-refresh
does not return an error, it only logs failures. Additionally, by only notifying "alive" members we will skip members whose heartbeat is lagging (the cut off for this is configurable viacluster.offline_threshold
, and that do not currently have connectivity with the initiating member (checked via TLS handshake).Also note that the identity cache is also updated on DQLite heartbeat:
The current approach was thought to be robust enough. Notifying live members should be sufficient because those that are not live will execute an identity cache refresh on heartbeat. However, there are clearly some issues.
Points of failure
Possible solutions
These are just a few ideas I have. Please feel free to add more!
More rigorous notification strategy
We could change the notification policy when calling
POST /internal/identity-cache-refresh
to "NotifyAll". In this case, the notifier will bail early if not all members are online (before sending the notification). Additionally, we can return an error from the internal endpoint if the cache fails to update. With this change, the steps for API endpoints that modify the cache would be as follows:POST /internal/identity-cache-refresh
. All members must be sent the notification.i. On error, revert database changes and return an error to the caller.
This approach is likely an improvement on the current situation, as it would bail early if not all members are considered "alive". However, the problem still remains that some members may successfully update their identity cache, and some may not. E.g. the initiator will return an error to the caller if the cache fails to update on one member, but other members may have completed this successfully using the contents of the database before the change was reverted.
Update cache on next request
We could change the behaviour of
POST /internal/identity-cache-refresh
to set a flag e.g.doUpdateCache
in other members such that the identity cache is updated not immediately, but on the next request. This would allow us to change the flow of API endpoints that modify the cache:i. As part of authentication, check
doUpdateCache
flag and update cache if required. If the cache fails to update, return an error to the caller. If the cache updates successfully, unset thedoUpdateCache
flag and continue.POST /internal/identity-cache-refresh
. All members must be sent the notification.i. On error, return an error to the caller.
i. On error, return an error to the caller that indicates that the change has been successfully made, but the member was not able to update it's cache.
ii. On success, unset the
doUpdateCache
flag.This approach is advantageous because it allows handling cluster notification errors before making any changes to the database. If the notification fails, it fails before any database changes are made. If the database change fails, the other members will still refresh their cache, but it will not have changed (i.e. no auth concerns). If the cache in the initiator fails to update, it will update before authenticating the next request.
However, there is a race condition here. If a request is sent concurrently to another member after the
doUpdateCache
flag is set (via notification), but before the database change is committed, then the other member's cache will have incorrect data and will not know to perform an update.Enforce cache consistency via the database
The identity cache can hold a hash of its contents. If this hash is written to the database (location TBD). We can verify the hash before authenticating requests. The flow for API endpoints would look like:
i. As part of authentication, check that the hash of the identity cache matches what is in the database and update cache if required. If the cache fails to update, return an error to the caller. If the cache updates successfully, set the new hash in the cache.
i. On error, return an error to the caller that indicates that the change has been successfully made, but the member was not able to update it's cache.
ii. On success, set the new hash in the cache.
This approach is advantageous because no cluster notifications are required (they can be used optionally to avoid incurring a cost to the next caller). However, this will require an extra database call for every API endpoint.
Conclusions
Of the three approaches, I think
#3
is the safest and simplest. But this will increase latency across the whole API (especially for the non-leader). I'd also be happy to go with#2
but we'll need to consider how this might work with concurrent callers (e.g. something like terraform) especially when LXD is running behind a load-balancer/round-robin DNS.For implementation, in both cases I think it will be best to move more logic into the cache itself to reduce duplication. We could have it function like a write-through cache to perform the database calls, as well as notification logic.
Happy to hear of any other solutions/ideas/thoughts on this. Thanks.
The text was updated successfully, but these errors were encountered: