-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Privateclusterfile #2204
base: main
Are you sure you want to change the base?
Privateclusterfile #2204
Conversation
This lets us not open many connections, each using its own cluster file. If each instance has its private cluster file, they can run out of sync when the cluster changes. If they share a cluster file, there can be races as multiple threads write to the same file.
Instead of reading the updated connection string from fdbcli's cluster file, ask the cluster. This is meant to make it less likely to have races in case the file is unexpectedly being written to.
Thought about it a bit more, here's an idea how to let the connection string flow cleanly. General idea: the coordinators decide on the cluster file content, the authoritative data is in the coordinated state. The operator can suggest a change by running the "coordinators" command. So the operator should just distribute it. There is already proven code in the FDB client that keeps a cluster file in sync with the coordinated state. Currently, the operator overwrites the cluster file frequently. Proposal: When we see a cluster UID for the first time, we write its seed connection string to a file, call OpenDatabase() on it immediately and never touch the file again. The FDB client library will keep it up-to-date. To run fdbcli, we make a temporary copy of the cluster file (fdb updates it atomically in the file system) and delete the copy when done. We watch the authoritative connection string and publish it as seed connection string to the clients like today. I'll think about it a bit more and maybe update the PR next week (or make a second one for discussion), I think this is cleaner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all thanks for taking the time to work on this! We have plans to cut a new operator release and drop the support for older FDB release, that will help us in a way that we are able to make use of the mgmt API and then a lot of the fdbcli interactions will go away.
For some historical background: We initially started with having random cluster files but there were some issues in the previous versions of the go bindings where the tracked clusters where not delete when the DB was closed which then was causing a memory leak. So we have to ensure we are closing the DB properly.
if err != nil { | ||
return fdb.Database{}, err | ||
} | ||
database, err := fdb.OpenDatabase(clusterFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use OpenWithConnectionString
to bypass the other cache on Go SDK side; was briefly discussed with @johscheuer here
Description
FoundationDB Operator creates cluster files in the temp directory so that fdbcli and the FoundationDB client library can connect to the cluster. The cluster file name is deterministically derived from the cluster UID.
Three different actors write to and read from this file without synchronization:
While the cluster controller is single-threaded, the library may be writing in the background. Since introduction of backup and restore controllers, there can be multiple active instances of admin client at the same time. The cluster controller is holding an instance of admin client while the subreconcilers are creating other instances, though I think it is inactive during reconciliation and this last point is not an actual problem.
If a race occurs, e.g. the reader finds the cluster file in a transient state where the file is truncated or temporarily overwritten with the now-invalid seed connection string, or a writer overwrites a legitimate new file), this could go wrong. For example, when changing coordinators, we read back the file after fdbcli exits, assuming it contains the new connection string.
To prevent this class of problems, the proposed change will create a unique cluster file for each instance of admin client and delete it when done. It instantiates a singleton database instance with its private cluster file for each cluster UID and keeps it around. To be extra safe, read the new connection string from the cluster instead of from the cluster file after changing coordinators to get an authoritative result.
Type of change
Discussion
We experienced an incident where the operator and subsequently all clients lost contact with the cluster, and I believe this was due to such a race. We have seen connection strings running out of sync before in a less catastrophic way. Our setup is a bit unusual and maybe that exposes us more than others: Our cluster runs in three_data_hall mode, which means a "cluster"'s coordinators are occasionally changing unexpectedly because we change coordinators on another data hall - the three "clusters" are a single FDB cluster with a shared coordinator set after all. And we are running coordinators on dedicated servers, using a custom class, so the servers actually shut down after losing the coordinator role.
The singleton database may be a bit confusing, because the FDB client library internally keeps a singleton map like this, too. They use the cluster file name as key. We need this because with the unique cluster file names, we'd now create more and more databases in the library's cache that never get cleaned up, leading to a memory leak. Alternative approaches could be to keep a single cluster file for the library, and only use temporary files for fdbcli invocations. We could ask the cluster (via the libarary) for the current connection string each time to write a guaranteed-correct cluster file for fdbcli. These are more involved code changes, please advise.
Testing
Unit tests pass. Still working on my local e2e test setup.
Does this change look reasonable? Then I can try it in our staging systems for a while.
Documentation
n/a
Follow-up
n/a