Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC: could not locate a valid checkpoint record #327

Closed
alexgleason opened this issue Sep 16, 2024 · 8 comments
Closed

PANIC: could not locate a valid checkpoint record #327

alexgleason opened this issue Sep 16, 2024 · 8 comments

Comments

@alexgleason
Copy link
Contributor

alexgleason commented Sep 16, 2024

Follow up to #321

On Deno v1.46.3, I am creating a pglite instance with file://data/pgdata (so nodefs). After closing the process and starting the existing database again, it seems the database is corrupted:

➜  ditto git:(main) deno task dev
Task dev deno run -A --watch src/server.ts
Watcher Process started.
Running migrations...
pglite: found DB, resuming
pgfs:ext OK /tmp/pglite/lib/postgresql/pg_trgm.so []
FIXME: int shmctl (int __shmid=666, int __cmd=2, struct shmid_ds *__buf=0x4611b78)
# FIXING: void *shmat (int __shmid=666, const void *__shmaddr=0, int __shmflg=0)
# FIXING: int shmget (key_t __key=40117774, size_t __size=40, int __shmflg=1920) pagesize default=65536
# FIXING: void *shmat (int __shmid=666, const void *__shmaddr=0, int __shmflg=0)
2024-09-16 14:44:37.702 GMT [42] LOG:  database system was interrupted; last known up at 2024-09-12 18:30:18 GMT
# 117(FATAL): insert_timeout(TimeoutId id=11, int index=0): /home/runner/work/pglite/pglite/postgresql-16.4/src/backend/utils/misc/timeout.c
2024-09-16 14:44:37.831 GMT [42] LOG:  invalid resource manager ID in checkpoint record
2024-09-16 14:44:37.831 GMT [42] PANIC:  could not locate a valid checkpoint record
Aborted()
RuntimeError: Aborted(). Build with -sASSERTIONS for more info.
    at abort (file:///home/alex/.cache/deno/npm/registry.npmjs.org/@soapbox.pub/pglite/0.2.10/dist/postgres.js:740:11)
    at __abort_js (file:///home/alex/.cache/deno/npm/registry.npmjs.org/@soapbox.pub/pglite/0.2.10/dist/postgres.js:7244:7)
    at abort (wasm://wasm/postgres.wasm-02856ac6:1:7189834)
    at errfinish (wasm://wasm/postgres.wasm-02856ac6:1:2120762)
    at InitWalRecovery (wasm://wasm/postgres.wasm-02856ac6:1:5426759)
    at StartupXLOG (wasm://wasm/postgres.wasm-02856ac6:1:5273250)
    at InitPostgres (wasm://wasm/postgres.wasm-02856ac6:1:625849)
    at AsyncPostgresSingleUserMain (wasm://wasm/postgres.wasm-02856ac6:1:2510013)
    at pg_initdb (wasm://wasm/postgres.wasm-02856ac6:1:3412565)
    at Object.Module._pg_initdb (file:///home/alex/.cache/deno/npm/registry.npmjs.org/@soapbox.pub/pglite/0.2.10/dist/postgres.js:10397:109)

(I am running a fork with #306, but I don't think this code change is related) EDIT: I am running pglite 0.2.8.

Background

If I start fresh with no pglite data on disk, I see this a ton of debug logs, including this error:

2024-09-16 14:49:03.618 GMT [42] FATAL:  could not open file "global/pg_filenode.map": No such file or directory
# 599: proc_exit(FATAL) ignored
2024-09-16 14:49:03.618 GMT [42] FATAL:  could not read file "global/pg_filenode.map": Bad file descriptor
# 599: proc_exit(FATAL) ignored
2024-09-16 14:49:03.618 GMT [42] WARNING:  fd passed to CloseTransientFile was not obtained from OpenTransientFile
2024-09-16 14:49:03.618 GMT [42] FATAL:  could not close file "global/pg_filenode.map": Bad file descriptor
# 599: proc_exit(FATAL) ignored
2024-09-16 14:49:03.618 GMT [42] FATAL:  relation mapping file "global/pg_filenode.map" contains invalid data
# 599: proc_exit(FATAL) ignored
2024-09-16 14:49:03.618 GMT [42] FATAL:  relation mapping file "global/pg_filenode.map" contains incorrect checksum
# 599: proc_exit(FATAL) ignored
# 549: PG_RE_THROW(ERROR : 0)
2024-09-16 14:49:03.618 GMT [42] FATAL:  could not find relation mapping for relation "pg_database", OID 1262
# 599: proc_exit(FATAL) ignored
TRAP: failed Assert("pg_re_throw tried to return"), File: "/home/runner/work/pglite/pglite/postgresql-16.4/src/backend/utils/error/elog.c", Line: 2032, PID: 42
Aborted()
error: Uncaught (in worker "") (in promise) RuntimeError: Aborted(). Build with -sASSERTIONS for more info.
    at abort (file:///home/alex/.cache/deno/npm/registry.npmjs.org/@soapbox.pub/pglite/0.2.10/dist/postgres.js:740:11)
    at __abort_js (file:///home/alex/.cache/deno/npm/registry.npmjs.org/@soapbox.pub/pglite/0.2.10/dist/postgres.js:7244:7)
    at abort (wasm://wasm/postgres.wasm-02856ac6:1:7189834)
    at ExceptionalCondition (wasm://wasm/postgres.wasm-02856ac6:1:2115639)
    at pg_re_throw (wasm://wasm/postgres.wasm-02856ac6:1:2129105)
    at errfinish (wasm://wasm/postgres.wasm-02856ac6:1:2120745)
    at RelationInitPhysicalAddr (wasm://wasm/postgres.wasm-02856ac6:1:566537)
    at formrdesc (wasm://wasm/postgres.wasm-02856ac6:1:578018)
    at RelationCacheInitializePhase2 (wasm://wasm/postgres.wasm-02856ac6:1:574931)
    at InitPostgres (wasm://wasm/postgres.wasm-02856ac6:1:625941)
2024-09-16 14:49:03.630 GMT [42] DEBUG:  rehashing catalog cache id 32 for pg_index; 129 tups, 64 buckets
# 338 cleanup(boot): /home/runner/work/pglite/pglite/postgresql-16.4/src/backend/bootstrap/bootstrap.c
# 108:fake shutdown
# skipped shmem_exit_index=5/6
# before_shmem_exit_index=4/6
2024-09-16 14:49:03.806 GMT [42] NOTICE:  shutting down
2024-09-16 14:49:03.806 GMT [42] DEBUG:  performing replication slot checkpoint
2024-09-16 14:49:03.829 GMT [42] DEBUG:  attempting to remove WAL segments older than log file 000000000000000000000000
2024-09-16 14:49:03.829 GMT [42] DEBUG:  SlruScanDirectory invoking callback on pg_subtrans/0000

after this, the deno process stays alive and continues for a couple seconds, until the http server is started then it finally crashes:

Listening on http://0.0.0.0:4036/
error: Uncaught (in promise) Error: Unhandled error in child worker.
    at Worker.#pollControl (ext:runtime/11_workers.js:204:19)
    at eventLoopTick (ext:core/01_core.js:175:7)
Watcher Process failed. Restarting on file change...

After this, if I quit and run again, I can use PGlite normally!

After this first set of errors, if I start deno task dev again, the whole thing works without any issues! That is, until I Ctrl+C and start again.

PANIC: could not locate a valid checkpoint record

Now after starting again, I get the first error.

What I tried

I tried adding a SIGINT listener in my application code to manually call pglite.close(), ensuring it's closed cleanly. This didn't help.

What I didn't try:

  • a different version of Deno
  • a different version of PGlite (I can't because this database relies a lot on json, and needs Fix array serialization #306)
@alexgleason
Copy link
Contributor Author

I bet #246 would help solve this problem.

The problem is the WAL getting corrupted. Maybe if I could change settings like synchronous_commit, fsync, wal_level etc it could help.

Although I am very confused why calling pglite.close() doesn't fully prevent this problem. Not sure if a Deno issue or workload issue.

@samwillis
Copy link
Collaborator

There was a bug in 0.2.7 where it wasn't checkpointing the WAL on close, 0.2.8 fixes that. Does it still happen?

@alexgleason
Copy link
Contributor Author

@samwillis Actually, it does seem like if I call pglite.close() the problem is avoided on 0.2.8. 👀

@alexgleason
Copy link
Contributor Author

Thank you for fixing this! I am able to use in-memory completely in my tests now, and getting close to being able to use the FS in local dev.

I am still encountering other bugs. I will close this and make a separate issue.

(Also, I am fighting with Deno to make it do the right thing on sigint, but that's Deno's fault.)

The docs should highlight the importance of calling .close() when using FS storage.

@alexgleason
Copy link
Contributor Author

Btw it does seem to be an issue with my workload, because I can't create a test that reproduces the problem, even with nodefs storage in Deno. I will try to do that before opening another issue.

@alexgleason
Copy link
Contributor Author

Oh boy, I figured it out! It's my Web Worker. I am creating a second PGlite instance inside of a Web Worker in Deno and that's making everything go crazy. If I delete that it works perfectly. Even sigint does not corrupt my database. 🤦 Maybe it's just a concurrency issue? Probably both processes touching the pglite database at the same time is what corrupted it in the first place.

@samwillis
Copy link
Collaborator

Yeh, two connections to the same datadir isn't supported and will do crazy things.

I'm hopeful we can support it in future, but there engineering (and possibly lobbying of browser devs) needed first.

@alexgleason
Copy link
Contributor Author

Btw, I just wanted to say I fixed the problem in my code. I was not meaning to run 2 instances of PGlite, just an accident. So once I sorted that out, PGlite works perfectly in Deno. 👍 Now I'm very happy. Thanks again for this awesome and brave project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants