Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double Close() of the *topo.Server leads to panics when starting vttablet #17093

Closed
frouioui opened this issue Oct 28, 2024 · 0 comments · Fixed by #17094
Closed

Double Close() of the *topo.Server leads to panics when starting vttablet #17093

frouioui opened this issue Oct 28, 2024 · 0 comments · Fixed by #17094

Comments

@frouioui
Copy link
Member

frouioui commented Oct 28, 2024

Overview of the Issue

When starting vttablet, we create a *topo.Server we have a defer to close this server but we also close the server when we return errors here and here. Which means we call ts.Close() twice, now if we look at the content of that function as seen below, we can see we call a method on both globalCell and globalReadOnlyCell which are set to nil soon after. Obviously this does not play out well when we call ts.Close() twice.

vitess/go/vt/topo/server.go

Lines 345 to 358 in 16b05c1

func (ts *Server) Close() {
ts.globalCell.Close()
if ts.globalReadOnlyCell != ts.globalCell {
ts.globalReadOnlyCell.Close()
}
ts.globalCell = nil
ts.globalReadOnlyCell = nil
ts.mu.Lock()
defer ts.mu.Unlock()
for _, cc := range ts.cellConns {
cc.conn.Close()
}
ts.cellConns = make(map[string]cellConn)
}

This issue can lead to the following panic, which was first observed in an upgrade-test of the vitess-operator:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x116e958]

goroutine 1 [running]:
vitess.io/vitess/go/vt/topo.(*Server).Close(0xc000149600)
        vitess.io/vitess/go/vt/topo/server.go:346 +0x38
vitess.io/vitess/go/cmd/vttablet/cli.run(0x45111e0, {0xc0000d6b40?, 0x7?, 0x2657355?})
        vitess.io/vitess/go/cmd/vttablet/cli/cli.go:176 +0x890
github.com/spf13/cobra.(*Command).execute(0x45111e0, {0xc00014a010, 0x1e, 0x1f})
        github.com/spf13/[email protected]/command.go:985 +0xaaa
github.com/spf13/cobra.(*Command).ExecuteC(0x45111e0)
        github.com/spf13/[email protected]/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0x0?)
        github.com/spf13/[email protected]/command.go:1041 +0x13
main.main()
        vitess.io/vitess/go/cmd/vttablet/vttablet.go:26 +0x1e

I can track down the apparition of the issue to #15928, in this PR we are changing the code from closing the *topo.Server in a servenv.OnClose(func() {}) function, to closing now it in a defer.

Reproduction Steps

This issue can be reproduced by running the upgrade test of vitess-operator locally, the Vitess image must be one that was built before this issue gets fixed.

$ cd vitess-operator
$ make upgrade-test
...

# in another terminal, observe the failure once we upgrade
$ kubectl get pods
NAME                                                         READY   STATUS             RESTARTS      AGE
example-commerce-x-x-zone1-vtorc-c13ef6ff-747599bc4d-mfpks   1/1     Running            0             14m
example-etcd-faf13de3-1                                      1/1     Running            0             18m
example-etcd-faf13de3-2                                      1/1     Running            0             18m
example-etcd-faf13de3-3                                      1/1     Running            0             18m
example-vttablet-zone1-0790125915-4e37d9d5                   3/3     Running            1 (15m ago)   18m
example-vttablet-zone1-2469782763-bfadd780                   2/3     CrashLoopBackOff   7 (58s ago)   14m
example-vttablet-zone1-2548885007-46a852d0                   3/3     Running            1 (16m ago)   18m
example-zone1-vtctld-1d4dcad0-5fbd746b48-vqk2j               1/1     Running            0             14m
example-zone1-vtgate-bc6cde92-695477d5b-s7jjl                1/1     Running            0             14m
vitess-operator-586755589d-d446c                             1/1     Running            0             14m

$ kubectl logs example-vttablet-zone1-2469782763-bfadd780
...
I1028 19:36:10.580694       1 tm_init.go:367] TabletManager Start took ~182 ms
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x116e958]

goroutine 1 [running]:
vitess.io/vitess/go/vt/topo.(*Server).Close(0xc000301280)
        vitess.io/vitess/go/vt/topo/server.go:346 +0x38
vitess.io/vitess/go/cmd/vttablet/cli.run(0x45111e0, {0xc00003cb40?, 0x7?, 0x2657355?})
        vitess.io/vitess/go/cmd/vttablet/cli/cli.go:176 +0x890
github.com/spf13/cobra.(*Command).execute(0x45111e0, {0xc000234010, 0x1e, 0x1f})
        github.com/spf13/[email protected]/command.go:985 +0xaaa
github.com/spf13/cobra.(*Command).ExecuteC(0x45111e0)
        github.com/spf13/[email protected]/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0x0?)
        github.com/spf13/[email protected]/command.go:1041 +0x13
main.main()
        vitess.io/vitess/go/cmd/vttablet/vttablet.go:26 +0x1e

Binary Version

>= v21.0.0-rc1

Operating System and Environment details

n/a

Log Fragments

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant