Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting pod causes "panic: assertion failed: Page expected to be: 4190, but self identifies as 0" #28626

Open
WoodyWoodsta opened this issue Oct 8, 2024 · 2 comments

Comments

@WoodyWoodsta
Copy link

WoodyWoodsta commented Oct 8, 2024

Describe the bug
I have a 3 node vault cluster using raft storage, in Kubernetes. If I restart one of the pods, it fails immediately and continuously with the following error:

panic: assertion failed: Page expected to be: 4190, but self identifies as 0

goroutine 1 [running]:
github.com/hashicorp-forge/bbolt._assert(...)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/db.go:1387
github.com/hashicorp-forge/bbolt.(*page).fastCheck(0x79acc0b2e000, 0x105e)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/page.go:57 +0x1d9
github.com/hashicorp-forge/bbolt.(*Tx).page(0x79acbfc2a000?, 0x88b5d80?)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:534 +0x7b
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x4, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:546 +0x5d
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x3, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x2, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPageInternal(0xc00389a000, {0xc0033a25f0, 0x1, 0xa}, 0xc0037fe298)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:555 +0xc8
github.com/hashicorp-forge/bbolt.(*Tx).forEachPage(...)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx.go:542
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket(0xc00389a000, 0xc00339bf00, 0xc0037fe6a0, 0xc0037fe5e0, {0xcff58f0, 0x13361e40}, 0xc0033aa300)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx_check.go:83 +0x114
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket.func2({0x79acbfb54140?, 0xc0033a25a0?, 0xc003381108?})
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx_check.go:110 +0x90
github.com/hashicorp-forge/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc0037fe498)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/bucket.go:403 +0x96
github.com/hashicorp-forge/bbolt.(*Tx).checkBucket(0xc00389a000, 0xc00389a018, 0xc0037fe6a0, 0xc0037fe5e0, {0xcff58f0, 0x13361e40}, 0xc0033aa300)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/tx_check.go:108 +0x255
github.com/hashicorp-forge/bbolt.(*DB).freepages(0xc003392908)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/db.go:1205 +0x225
github.com/hashicorp-forge/bbolt.(*DB).loadFreelist.func1()
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/db.go:417 +0xc5
sync.(*Once).doSlow(0x1dea4e0?, 0xc003392ad0?)
/opt/hostedtoolcache/go/1.22.7/x64/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
/opt/hostedtoolcache/go/1.22.7/x64/src/sync/once.go:65
github.com/hashicorp-forge/bbolt.(*DB).loadFreelist(0xc003392908?)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/db.go:413 +0x45
github.com/hashicorp-forge/bbolt.Open({0xc0033ba3a8, 0x14}, 0x180, 0xc0033ee9c0)
/home/runner/go/pkg/mod/github.com/hashicorp-forge/[email protected]/db.go:295 +0x430
github.com/hashicorp/vault/physical/raft.(*FSM).openDBFile(0xc0033ea500, {0xc0033ba3a8, 0x14})
/home/runner/work/vault/vault/physical/raft/fsm.go:264 +0x266
github.com/hashicorp/vault/physical/raft.NewFSM({0xc003380241, 0xb}, {0xc0000ba133, 0x7}, {0xd096f48, 0xc003398c90})
/home/runner/work/vault/vault/physical/raft/fsm.go:218 +0x433
github.com/hashicorp/vault/physical/raft.NewRaftBackend(0xc003398a80, {0xd096f48, 0xc003398c60})
/home/runner/work/vault/vault/physical/raft/raft.go:439 +0xed
github.com/hashicorp/vault/command.(*ServerCommand).setupStorage(0xc003392008, 0xc0033da008)
/home/runner/work/vault/vault/command/server.go:811 +0x319
github.com/hashicorp/vault/command.(*ServerCommand).Run(0xc003392008, {0xc0000b4860, 0x1, 0x1})
/home/runner/work/vault/vault/command/server.go:1188 +0x10e6
github.com/hashicorp/cli.(*CLI).Run(0xc003814f00)
/home/runner/go/pkg/mod/github.com/hashicorp/[email protected]/cli.go:265 +0x5b8
github.com/hashicorp/vault/command.RunCustom({0xc0000b4850?, 0x2?, 0x2?}, 0xc0000061c0?)
/home/runner/work/vault/vault/command/main.go:243 +0x9a6
github.com/hashicorp/vault/command.Run(...)
/home/runner/work/vault/vault/command/main.go:147
main.main()
/home/runner/work/vault/vault/main.go:13 +0x47

The only way to solve this right now is to completely remove the persistent volume for the pod, and restart. This means it's impossible to update the Vault cluster without doing a full restore.

To Reproduce
Steps to reproduce the behavior:

  1. Run a 3 node cluster
  2. Restart one of the nodes

Expected behavior
Vault is able to recover from restarts.

Environment:

  • Vault Server Version (retrieve with vault status): 1.17.5
  • Vault CLI Version (retrieve with vault version): Vault v1.17.5 (4d0c53e), built 2024-08-30T15:54:57Z
  • Server Operating System/Architecture: Kubernetes, bare metal

Vault server configuration file(s):

disable_mlock = true
ui = true

listener "tcp" {
  tls_disable = 1
  address = "[::]:8200"
  cluster_address = "[::]:8201"

  # Enable unauthenticated metrics access (necessary for Prometheus Operator)
  telemetry {
    unauthenticated_metrics_access = "true"
  }
}

storage "raft" {
  path = "/vault/data"
  raft_wal = "true"
  raft_log_verifier_enabled = "true"

  retry_join {
    leader_api_addr = "http://vault-0.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-1.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-2.vault-internal:8200"
  }
}

service_registration "kubernetes" {}

telemetry {
  prometheus_retention_time = "30s"
  disable_hostname = true
}

Additional context

  • I understand the panic is from boltdb, but I'm wondering if Vault is doing anything specific which causes this issue.
  • I attempted to switch to use raft_wal to fix the issue, but it looks as though boltdb is still used. This issue occurs regradless of setting raft_wal
@heatherezell heatherezell added k8s bug Used to indicate a potential bug storage/raft labels Oct 8, 2024
@miagilepner
Copy link
Contributor

It looks like your vault.db was somehow corrupted. It's hard to determine what exactly the cause of the corruption is without the bbolt file.

I'd suggest using the bbolt command line utility(https://developer.hashicorp.com/vault/tutorials/monitoring/inspect-data-boltdb) to inspect the contents of a copy of the vault.db file, particularly page 4190 that is throwing this error. You could also consider performing a bbolt compact operation on the copy of the database, to see if the free list compaction is able to resolve the corruption.

@miagilepner miagilepner removed the bug Used to indicate a potential bug label Oct 23, 2024
@WoodyWoodsta
Copy link
Author

Unfortunately I have resorted to switching the storage backend to postgres, so I won't be able to try those commands. It was too unstable for our production requirements.

What seemed odd to me is that it occurred on every pod restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants