extend timeout for Hasura migrations at boot #4080

freemvmt · 2024-12-13T15:38:05Z

Neither of the interventions in #4051 or #4076 resolved the Hasura boot loop issue on scale up.

This PR takes a different tack, based on a closer reading of the logs for the instances which are failing to boot (all of which follow a similar pattern). Essentially it allows the Hasura server 5 minutes to pull metadata and handle migrations from the RDS-hosted PostgreSQL database.

(includes bits for gitignore to ease my workflow with python)

Context

After a closer look at the logs (example), I realised that the key might be in the final log from the hasura container:

{"timestamp":"2024-12-12T17:06:11.000+0000","level":"info","type":"startup","detail":{"kind":"migrations-startup","info":"failed waiting for 9691, try increasing HASURA_GRAPHQL_MIGRATIONS_SERVER_TIMEOUT (default: 30)"}}

That is, the Hasura server itself is failing to load the metadata which it requires for setup from the upstream database. So I’m drawn to the conclusion that it’s the underlying Postgres database (hosted on AWS RDS) which is the root of the issue.

Indeed, the staging instance (which is a T3 Micro, compared to a Medium on prod) will have a maximum of 104 or so connections with it’s 1GB of memory (i.e. 1x10^9 / 9531392 - see AWS docs on RDS instance quotas and instance hardware specs).

In practice the number of connections it can handle is less, because some memory must be reserved for running the database itself, e.g. the following grab suggests it tops out at ~ 78:

Hasura servers default to having 1 pool, with 1 'stripe' with at most 50 connections, so it might be that attempting to boot a 2nd server is simply overwhelming our database.

There are a few possible ways forward here (see Notion doc), of which this PR is the first and simplest.

github-actions · 2024-12-13T16:01:47Z

Removed vultr server and associated DNS entries

DafyddLlyr · 2024-12-13T16:13:43Z

Thanks for the super clear PR description - always appreciated!

freemvmt · 2024-12-13T16:20:46Z

Thanks for the super clear PR description - always appreciated!

Thanks! I just try to imagine being that dev trying to make sense of it 2yrs later 😅

extend timeout for Hasura migrations at boot

6f7d81c

freemvmt requested a review from a team December 13, 2024 15:38

add pyenv entries to gitignore

f41c484

freemvmt mentioned this pull request Dec 13, 2024

bump staging RDS instance to T3 Small #4081

Merged

DafyddLlyr approved these changes Dec 13, 2024

View reviewed changes

freemvmt merged commit 3c0358f into main Dec 13, 2024
12 checks passed

freemvmt deleted the extend-hasura-boot-timeout branch December 13, 2024 16:25

freemvmt mentioned this pull request Dec 20, 2024

[infrastructure] add explicit autoscaling logic to Hasura service #4096

Merged

freemvmt mentioned this pull request Jan 16, 2025

[hasura] finalise infrastructure config at sensible values #4161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend timeout for Hasura migrations at boot #4080

extend timeout for Hasura migrations at boot #4080

freemvmt commented Dec 13, 2024 •

edited

Loading

github-actions bot commented Dec 13, 2024 •

edited

Loading

DafyddLlyr commented Dec 13, 2024

freemvmt commented Dec 13, 2024

extend timeout for Hasura migrations at boot #4080

extend timeout for Hasura migrations at boot #4080

Conversation

freemvmt commented Dec 13, 2024 • edited Loading

Context

github-actions bot commented Dec 13, 2024 • edited Loading

DafyddLlyr commented Dec 13, 2024

freemvmt commented Dec 13, 2024

freemvmt commented Dec 13, 2024 •

edited

Loading

github-actions bot commented Dec 13, 2024 •

edited

Loading