extend timeout for Hasura migrations at boot #4080
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related to this ticket.
Neither of the interventions in #4051 or #4076 resolved the Hasura boot loop issue on scale up.
This PR takes a different tack, based on a closer reading of the logs for the instances which are failing to boot (all of which follow a similar pattern). Essentially it allows the Hasura server 5 minutes to pull metadata and handle migrations from the RDS-hosted PostgreSQL database.
(includes bits for gitignore to ease my workflow with python)
See also #4081.
Context
After a closer look at the logs (example), I realised that the key might be in the final log from the
hasura
container:That is, the Hasura server itself is failing to load the metadata which it requires for setup from the upstream database. So I’m drawn to the conclusion that it’s the underlying Postgres database (hosted on AWS RDS) which is the root of the issue.
Indeed, the staging instance (which is a T3 Micro, compared to a Medium on prod) will have a maximum of 104 or so connections with it’s 1GB of memory (i.e.
1x10^9 / 9531392
- see AWS docs on RDS instance quotas and instance hardware specs).In practice the number of connections it can handle is less, because some memory must be reserved for running the database itself, e.g. the following grab suggests it tops out at ~ 78:
Hasura servers default to having 1 pool, with 1 'stripe' with at most 50 connections, so it might be that attempting to boot a 2nd server is simply overwhelming our database.
There are a few possible ways forward here (see Notion doc), of which this PR is the first and simplest.