Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check and reconnect method to Database class #466

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dale-wahl
Copy link
Member

This would allow the Database class to reconnect on SSL errors or recover if the connection was closed. If we run a bad query it will not recover from psycopg2.errors.InFailedSqlTransaction: current transaction is aborted, commands ignored until end of transaction block and require a rollback.

I tested it by shutting down the database, attempting to run some processes (which obviously failed), and then restarting the database. It seemed to recover just fine (well, any processors that failed would not restart until we restart the backend, but the frontend recovered fine).

@dale-wahl
Copy link
Member Author

Hmmm. I thought about this some more last night and am not sure this is the best way forward.

Issue:
On a database disconnect both the backend and frontend crash and cannot recover automatically even if the database/connection comes back online. They must be restarted.

Dislike of this fix:
Creates another query (admittedly quick: SELECT 1;) for every db call. This was easy to implement, but it could be better.

Desire:
4CAT waits for a database connection to be re-established.

The backend manager currently dies. At minimum, if this was captured and waited, 4CAT would generally be fine. All currently running workers also crash including the API (the only always running worker) and there is no mechanism to restart them except via a restart (perhaps that could be changed).

The frontend cannot re-establish a connection. It does not actually crash, but even when the database is back up and could accept new connections, the frontend has no way of doing this. The error changes from unable to connect to a stale connection ad infinium and needs to be restarted.

Possible better solutions:
For the frontend, we could wrap to detect the stale connection and attempt reconnect. If it cannot, it will fail again, but will be able to recover whenever the database is reachable.

Backend... We could add to the manager to handle that disconnect. But maybe there is a better want to have any worker with a db disconnect wait. It could also check for self.interrupted to exit if needed.

Logging may need to be reviewed.

@dale-wahl dale-wahl removed the request for review from stijn-uva November 27, 2024 08:56
@dale-wahl dale-wahl marked this pull request as draft November 27, 2024 08:56
@dale-wahl
Copy link
Member Author

dale-wahl commented Dec 3, 2024

Some things I learned:

  • psycopg2 is being maintained, but not updated. Psycopg 3 will be an update to psycopg.
  • in fancy new psycopg (3), connections will have a check_connection method
  • connection pools would be useful in our case, but do not in and of themselves solve the disconnect issue (as current implementation in psycopg2 does not check the connection before providing it from the pool).
  • connection.closed only updates after an attempt to connect fails
  • some places recommend checking connection via connection.isolation_level, but this seems to no long be the case (our version returns None always)
  • creating a cursor from a connection does not check the connection (though it will fail if the connection is already closed)

If we desire, we could use a connection pool and, instead of creating a new connection with every worker, share the Database class and then use a connection pool in, for example, the get_cursor function. Not sure how much benefit we would see from that; it may be more useful in the frontend as, if I understand correctly, a single worker can handle multiple requests at the same time and thus would be using a single connection.

@dale-wahl
Copy link
Member Author

Refactored so instead of testing a query, it reconnects on any failed query. I tested disconnecting the database for both front and backend and believe I worked through any issues.

@dale-wahl dale-wahl marked this pull request as ready for review December 4, 2024 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant