Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend timeout for Hasura migrations at boot #4080

Merged
merged 2 commits into from
Dec 13, 2024
Merged

Conversation

freemvmt
Copy link
Contributor

@freemvmt freemvmt commented Dec 13, 2024

Related to this ticket.

Neither of the interventions in #4051 or #4076 resolved the Hasura boot loop issue on scale up.

This PR takes a different tack, based on a closer reading of the logs for the instances which are failing to boot (all of which follow a similar pattern). Essentially it allows the Hasura server 5 minutes to pull metadata and handle migrations from the RDS-hosted PostgreSQL database.

(includes bits for gitignore to ease my workflow with python)

See also #4081.

Context

After a closer look at the logs (example), I realised that the key might be in the final log from the hasura container:

{"timestamp":"2024-12-12T17:06:11.000+0000","level":"info","type":"startup","detail":{"kind":"migrations-startup","info":"failed waiting for 9691, try increasing HASURA_GRAPHQL_MIGRATIONS_SERVER_TIMEOUT (default: 30)"}}

That is, the Hasura server itself is failing to load the metadata which it requires for setup from the upstream database. So I’m drawn to the conclusion that it’s the underlying Postgres database (hosted on AWS RDS) which is the root of the issue.

Indeed, the staging instance (which is a T3 Micro, compared to a Medium on prod) will have a maximum of 104 or so connections with it’s 1GB of memory (i.e. 1x10^9 / 9531392 - see AWS docs on RDS instance quotas and instance hardware specs).

In practice the number of connections it can handle is less, because some memory must be reserved for running the database itself, e.g. the following grab suggests it tops out at ~ 78:

image

Hasura servers default to having 1 pool, with 1 'stripe' with at most 50 connections, so it might be that attempting to boot a 2nd server is simply overwhelming our database.

There are a few possible ways forward here (see Notion doc), of which this PR is the first and simplest.

@freemvmt freemvmt requested a review from a team December 13, 2024 15:38
Copy link

github-actions bot commented Dec 13, 2024

Removed vultr server and associated DNS entries

@DafyddLlyr
Copy link
Contributor

Thanks for the super clear PR description - always appreciated!

@freemvmt
Copy link
Contributor Author

Thanks for the super clear PR description - always appreciated!

Thanks! I just try to imagine being that dev trying to make sense of it 2yrs later 😅

@freemvmt freemvmt merged commit 3c0358f into main Dec 13, 2024
12 checks passed
@freemvmt freemvmt deleted the extend-hasura-boot-timeout branch December 13, 2024 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants