Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

WoodyWoodsta · 2024-10-08T10:24:17Z

Describe the bug
I have a 3 node vault cluster using raft storage, in Kubernetes. If I restart one of the pods, it fails immediately and continuously with the following error:

panic: assertion failed: Page expected to be: 4190, but self identifies as 0

goroutine 1 [running]:
github.com/hashicorp-forge/bbolt._assert(...)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/db.go:1387
github.com/hashicorp-forge/bbolt.(*page).fastCheck(0x79acc0b2e000, 0x105e)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/page.go:57 +0x1d9
github.com/hashicorp-forge/bbolt.(*Tx).page(0x79acbfc2a000?, 0x88b5d80?)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:534 +0x7b
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x4, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:546 +0x5d
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x3, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x2, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x1, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPage(...)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:542
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket(0xc00389a000, 0xc00339bf00, 0xc0037fe6a0, 0xc0037fe5e0, {0xcff58f0, 0x13361e40}, 0xc0033aa300)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx_check.go:83 +0x114
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket.func2({0x79acbfb54140?, 0xc0033a25a0?, 0xc003381108?})
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx_check.go:110 +0x90
github.com/hashicorp-forge/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc0037fe498)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/bucket.go:403 +0x96
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket(0xc00389a000, 0xc00389a018, 0xc0037fe6a0, 0xc0037fe5e0, {0xcff58f0, 0x13361e40}, 0xc0033aa300)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx_check.go:108 +0x255
github.com/hashicorp-forge/bbolt.(*DB).freepages(0xc003392908)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/db.go:1205 +0x225
github.com/hashicorp-forge/bbolt.(*DB).loadFreelist.func1()
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/db.go:417 +0xc5
sync.(*Once).doSlow(0x1dea4e0?, 0xc003392ad0?)
/opt/hostedtoolcache/go/1.22.7/x64/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
/opt/hostedtoolcache/go/1.22.7/x64/src/sync/once.go:65
github.com/hashicorp-forge/bbolt.(*DB).loadFreelist(0xc003392908?)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/db.go:413 +0x45
github.com/hashicorp-forge/bbolt.Open({0xc0033ba3a8, 0x14}, 0x180, 0xc0033ee9c0)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/db.go:295 +0x430
github.com/hashicorp/vault/physical/raft.(*FSM).openDBFile(0xc0033ea500, {0xc0033ba3a8, 0x14})
/home/runner/work/vault/vault/physical/raft/fsm.go:264 +0x266
github.com/hashicorp/vault/physical/raft.NewFSM({0xc003380241, 0xb}, {0xc0000ba133, 0x7}, {0xd096f48, 0xc003398c90})
/home/runner/work/vault/vault/physical/raft/fsm.go:218 +0x433
github.com/hashicorp/vault/physical/raft.NewRaftBackend(0xc003398a80, {0xd096f48, 0xc003398c60})
/home/runner/work/vault/vault/physical/raft/raft.go:439 +0xed
github.com/hashicorp/vault/command.(*ServerCommand).setupStorage(0xc003392008, 0xc0033da008)
/home/runner/work/vault/vault/command/server.go:811 +0x319
github.com/hashicorp/vault/command.(*ServerCommand).Run(0xc003392008, {0xc0000b4860, 0x1, 0x1})
/home/runner/work/vault/vault/command/server.go:1188 +0x10e6
github.com/hashicorp/cli.(*CLI).Run(0xc003814f00)
/home/runner/go/pkg/mod/github.com/hashicorp/[email protected]/cli.go:265 +0x5b8
github.com/hashicorp/vault/command.RunCustom({0xc0000b4850?, 0x2?, 0x2?}, 0xc0000061c0?)
/home/runner/work/vault/vault/command/main.go:243 +0x9a6
github.com/hashicorp/vault/command.Run(...)
/home/runner/work/vault/vault/command/main.go:147
main.main()
/home/runner/work/vault/vault/main.go:13 +0x47

The only way to solve this right now is to completely remove the persistent volume for the pod, and restart. This means it's impossible to update the Vault cluster without doing a full restore.

To Reproduce
Steps to reproduce the behavior:

Run a 3 node cluster
Restart one of the nodes

Expected behavior
Vault is able to recover from restarts.

Environment:

Vault Server Version (retrieve with vault status): 1.17.5
Vault CLI Version (retrieve with vault version): Vault v1.17.5 (4d0c53e), built 2024-08-30T15:54:57Z
Server Operating System/Architecture: Kubernetes, bare metal

Vault server configuration file(s):

disable_mlock = true
ui = true

listener "tcp" {
  tls_disable = 1
  address = "[::]:8200"
  cluster_address = "[::]:8201"

  # Enable unauthenticated metrics access (necessary for Prometheus Operator)
  telemetry {
    unauthenticated_metrics_access = "true"
  }
}

storage "raft" {
  path = "/vault/data"
  raft_wal = "true"
  raft_log_verifier_enabled = "true"

  retry_join {
    leader_api_addr = "http://vault-0.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-1.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-2.vault-internal:8200"
  }
}

service_registration "kubernetes" {}

telemetry {
  prometheus_retention_time = "30s"
  disable_hostname = true
}

Additional context

I understand the panic is from boltdb, but I'm wondering if Vault is doing anything specific which causes this issue.
I attempted to switch to use raft_wal to fix the issue, but it looks as though boltdb is still used. This issue occurs regradless of setting raft_wal

The text was updated successfully, but these errors were encountered:

miagilepner · 2024-10-23T14:39:30Z

It looks like your vault.db was somehow corrupted. It's hard to determine what exactly the cause of the corruption is without the bbolt file.

I'd suggest using the bbolt command line utility(https://developer.hashicorp.com/vault/tutorials/monitoring/inspect-data-boltdb) to inspect the contents of a copy of the vault.db file, particularly page 4190 that is throwing this error. You could also consider performing a bbolt compact operation on the copy of the database, to see if the free list compaction is able to resolve the corruption.

WoodyWoodsta · 2024-10-23T14:43:58Z

Unfortunately I have resorted to switching the storage backend to postgres, so I won't be able to try those commands. It was too unstable for our production requirements.

What seemed odd to me is that it occurred on every pod restart.

heatherezell added k8s bug Used to indicate a potential bug storage/raft labels Oct 8, 2024

miagilepner removed the bug Used to indicate a potential bug label Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

WoodyWoodsta commented Oct 8, 2024 •

edited

Loading

miagilepner commented Oct 23, 2024

WoodyWoodsta commented Oct 23, 2024

Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

Comments

WoodyWoodsta commented Oct 8, 2024 • edited Loading

miagilepner commented Oct 23, 2024

WoodyWoodsta commented Oct 23, 2024

WoodyWoodsta commented Oct 8, 2024 •

edited

Loading