Skip to content

Commit

Permalink
metrics: change vtbackup_duration_by_phase to binary-valued vtbackup_…
Browse files Browse the repository at this point in the history
…phase (#12973)

Signed-off-by: Max Englander <[email protected]>
  • Loading branch information
maxenglander authored Sep 20, 2023
1 parent adac810 commit ebbd418
Show file tree
Hide file tree
Showing 21 changed files with 845 additions and 216 deletions.
26 changes: 24 additions & 2 deletions changelog/18.0/18.0.0/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@
- [Deleted `V3` planner](#deleted-v3)
- [Deleted `k8stopo`](#deleted-k8stopo)
- [Deleted `vtgr`](#deleted-vtgr)
- [Deprecated VTBackup stat `DurationByPhase`](#deprecated-vtbackup-stat-duration-by-phase)
- **[New stats](#new-stats)**
- [VTGate Vindex unknown parameters](#vtgate-vindex-unknown-parameters)
- [VTBackup stat `Phase`](#vtbackup-stat-phase)
- [VTBackup stat `PhaseStatus`](#vtbackup-stat-phase-status)
- **[VTTablet](#vttablet)**
- [VTTablet: New ResetSequences RPC](#vttablet-new-rpc-reset-sequences)
Expand Down Expand Up @@ -115,17 +117,37 @@ the `k8stopo` has been removed.

The `vtgr` has been deprecated in Vitess 17, also see https://github.com/vitessio/vitess/issues/13300. With Vitess 18 `vtgr` has been removed.

#### <a id="deprecated-vtbackup-stat-duration-by-phase"/>Deprecated VTbackup stat `DurationByPhase`

VTBackup stat `DurationByPhase` is deprecated. Use the binary-valued `Phase` stat instead.

### <a id="new-stats"/>New stats

#### <a id="vtgate-vindex-unknown-parameters"/>VTGate Vindex unknown parameters

The VTGate stat `VindexUnknownParameters` gauges unknown Vindex parameters found in the latest VSchema pulled from the topology.

#### <a id="vtbackup-stat-phase"/>VTBackup `Phase` stat

In v17, the `vtbackup` stat `DurationByPhase` stat was added measuring the time spent by `vtbackup` in each phase. This stat turned out to be awkward to use in production, and has been replaced in v18 by a binary-valued `Phase` stat.

`Phase` reports a 1 (active) or a 0 (inactive) for each of the following phases:

* `CatchupReplication`
* `InitialBackup`
* `RestoreLastBackup`
* `TakeNewBackup`

To calculate how long `vtbackup` has spent in a given phase, sum the 1-valued data points over time and multiply by the data collection or reporting interval. For example, in Prometheus:

```
sum_over_time(vtbackup_phase{phase="TakeNewBackup"}) * <interval>
```
#### <a id="vtbackup-stat-phase-status"/>VTBackup `PhaseStatus` stat

`PhaseStatus` reports a 1 (active) or a 0 (inactive) for each of the following phases and statuses:

* `CatchUpReplication` phase has statuses `Stalled` and `Stopped`.
* `CatchupReplication` phase has statuses `Stalled` and `Stopped`.
* `Stalled` is set to `1` when replication stops advancing.
* `Stopped` is set to `1` when replication stops before `vtbackup` catches up with the primary.

Expand Down Expand Up @@ -165,4 +187,4 @@ removing Vitess support.

#### <a id="new-durability-policies"/>New Durability Policies

2 new inbuilt durability policies have been added to Vitess in this release namely `semi_sync_with_rdonly_ack` and `cross_cell_with_rdonly_ack`. These policies are exactly like `semi_sync` and `cross_cell` respectively, and differ just in the part where the rdonly tablets can also send semi-sync ACKs.
2 new inbuilt durability policies have been added to Vitess in this release namely `semi_sync_with_rdonly_ack` and `cross_cell_with_rdonly_ack`. These policies are exactly like `semi_sync` and `cross_cell` respectively, and differ just in the part where the rdonly tablets can also send semi-sync ACKs.
25 changes: 25 additions & 0 deletions go/cmd/vtbackup/plugin_opentsdb.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
/*
Copyright 2023 The Vitess Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package main

import "vitess.io/vitess/go/stats/opentsdb"

// This plugin imports opentsdb to register the opentsdb stats backend.

func init() {
opentsdb.Init("vtbackup")
}
83 changes: 63 additions & 20 deletions go/cmd/vtbackup/vtbackup.go
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,12 @@ const (
// forever for things that should be quick.
operationTimeout = 1 * time.Minute

phaseNameCatchUpReplication = "CatchUpReplication"
phaseStatusCatchUpReplicationStalled = "Stalled"
phaseStatusCatchUpReplicationStopped = "Stopped"
phaseNameCatchupReplication = "CatchupReplication"
phaseNameInitialBackup = "InitialBackup"
phaseNameRestoreLastBackup = "RestoreLastBackup"
phaseNameTakeNewBackup = "TakeNewBackup"
phaseStatusCatchupReplicationStalled = "Stalled"
phaseStatusCatchupReplicationStopped = "Stopped"
)

var (
Expand All @@ -127,20 +130,44 @@ var (
detachedMode bool
keepAliveTimeout = 0 * time.Second
disableRedoLog = false
durationByPhase = stats.NewGaugesWithSingleLabel(

// Deprecated, use "Phase" instead.
deprecatedDurationByPhase = stats.NewGaugesWithSingleLabel(
"DurationByPhaseSeconds",
"How long it took vtbackup to perform each phase (in seconds).",
"[DEPRECATED] How long it took vtbackup to perform each phase (in seconds).",
"phase",
)

// This gauge is updated 3*N times during the course of a vtbackup run,
// where N is the number of different phases vtbackup transitions through.
// Once to initialize to 0, another time to set the phase to active (1),
// and another to deactivate the phase (back to 0).
//
// At most a single phase is active at a given time.
//
// The sync gauge immediately reports changes to push-backed backends.
// The benefit of the sync gauge is that it makes verifying stats in
// integration tests a lot more tractable.
phase = stats.NewSyncGaugesWithSingleLabel(
"Phase",
"Active phase.",
"phase",
)
phaseNames = []string{
phaseNameCatchupReplication,
phaseNameInitialBackup,
phaseNameRestoreLastBackup,
phaseNameTakeNewBackup,
}
phaseStatus = stats.NewGaugesWithMultiLabels(
"PhaseStatus",
"Internal state of vtbackup phase.",
[]string{"phase", "status"},
)
phaseStatuses = map[string][]string{
phaseNameCatchUpReplication: {
phaseStatusCatchUpReplicationStalled,
phaseStatusCatchUpReplicationStopped,
phaseNameCatchupReplication: {
phaseStatusCatchupReplicationStalled,
phaseStatusCatchupReplicationStopped,
},
}
)
Expand Down Expand Up @@ -219,6 +246,9 @@ func main() {
defer topoServer.Close()

// Initialize stats.
for _, phaseName := range phaseNames {
phase.Set(phaseName, int64(0))
}
for phaseName, statuses := range phaseStatuses {
for _, status := range statuses {
phaseStatus.Set([]string{phaseName, status}, 0)
Expand Down Expand Up @@ -294,7 +324,7 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
if err := mysqld.Init(initCtx, mycnf, initDBSQLFile); err != nil {
return fmt.Errorf("failed to initialize mysql data dir and start mysqld: %v", err)
}
durationByPhase.Set("InitMySQLd", int64(time.Since(initMysqldAt).Seconds()))
deprecatedDurationByPhase.Set("InitMySQLd", int64(time.Since(initMysqldAt).Seconds()))
// Shut down mysqld when we're done.
defer func() {
// Be careful not to use the original context, because we don't want to
Expand Down Expand Up @@ -358,14 +388,19 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back

backupParams.BackupTime = time.Now()
// Now we're ready to take the backup.
phase.Set(phaseNameInitialBackup, int64(1))
defer phase.Set(phaseNameInitialBackup, int64(0))
if err := mysqlctl.Backup(ctx, backupParams); err != nil {
return fmt.Errorf("backup failed: %v", err)
}
durationByPhase.Set("InitialBackup", int64(time.Since(backupParams.BackupTime).Seconds()))
deprecatedDurationByPhase.Set("InitialBackup", int64(time.Since(backupParams.BackupTime).Seconds()))
log.Info("Initial backup successful.")
phase.Set(phaseNameInitialBackup, int64(0))
return nil
}

phase.Set(phaseNameRestoreLastBackup, int64(1))
defer phase.Set(phaseNameRestoreLastBackup, int64(0))
backupDir := mysqlctl.GetBackupDir(initKeyspace, initShard)
log.Infof("Restoring latest backup from directory %v", backupDir)
restoreAt := time.Now()
Expand Down Expand Up @@ -397,7 +432,8 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
default:
return fmt.Errorf("can't restore from backup: %v", err)
}
durationByPhase.Set("RestoreLastBackup", int64(time.Since(restoreAt).Seconds()))
deprecatedDurationByPhase.Set("RestoreLastBackup", int64(time.Since(restoreAt).Seconds()))
phase.Set(phaseNameRestoreLastBackup, int64(0))

// As of MySQL 8.0.21, you can disable redo logging using the ALTER INSTANCE
// DISABLE INNODB REDO_LOG statement. This functionality is intended for
Expand Down Expand Up @@ -455,6 +491,9 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
backupParams.BackupTime = time.Now()

// Wait for replication to catch up.
phase.Set(phaseNameCatchupReplication, int64(1))
defer phase.Set(phaseNameCatchupReplication, int64(0))

var (
lastStatus replication.ReplicationStatus
status replication.ReplicationStatus
Expand All @@ -479,26 +518,27 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
// We're caught up on replication to at least the point the primary
// was at when this vtbackup run started.
log.Infof("Replication caught up to %v after %v", status.Position, time.Since(waitStartTime))
durationByPhase.Set("CatchUpReplication", int64(time.Since(waitStartTime).Seconds()))
deprecatedDurationByPhase.Set("CatchUpReplication", int64(time.Since(waitStartTime).Seconds()))
break
}
if !lastStatus.Position.IsZero() {
if status.Position.Equal(lastStatus.Position) {
phaseStatus.Set([]string{phaseNameCatchUpReplication, phaseStatusCatchUpReplicationStalled}, 1)
phaseStatus.Set([]string{phaseNameCatchupReplication, phaseStatusCatchupReplicationStalled}, 1)
} else {
phaseStatus.Set([]string{phaseNameCatchUpReplication, phaseStatusCatchUpReplicationStalled}, 0)
phaseStatus.Set([]string{phaseNameCatchupReplication, phaseStatusCatchupReplicationStalled}, 0)
}
}
if !status.Healthy() {
log.Warning("Replication has stopped before backup could be taken. Trying to restart replication.")
phaseStatus.Set([]string{phaseNameCatchUpReplication, phaseStatusCatchUpReplicationStopped}, 1)
phaseStatus.Set([]string{phaseNameCatchupReplication, phaseStatusCatchupReplicationStopped}, 1)
if err := startReplication(ctx, mysqld, topoServer); err != nil {
log.Warningf("Failed to restart replication: %v", err)
}
} else {
phaseStatus.Set([]string{phaseNameCatchUpReplication, phaseStatusCatchUpReplicationStopped}, 0)
phaseStatus.Set([]string{phaseNameCatchupReplication, phaseStatusCatchupReplicationStopped}, 0)
}
}
phase.Set(phaseNameCatchupReplication, int64(0))

// Stop replication and see where we are.
if err := mysqld.StopReplication(nil); err != nil {
Expand All @@ -514,8 +554,8 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
if !status.Position.AtLeast(primaryPos) && status.Position.Equal(restorePos) {
return fmt.Errorf("not taking backup: replication did not make any progress from restore point: %v", restorePos)
}
phaseStatus.Set([]string{phaseNameCatchUpReplication, phaseStatusCatchUpReplicationStalled}, 0)
phaseStatus.Set([]string{phaseNameCatchUpReplication, phaseStatusCatchUpReplicationStopped}, 0)
phaseStatus.Set([]string{phaseNameCatchupReplication, phaseStatusCatchupReplicationStalled}, 0)
phaseStatus.Set([]string{phaseNameCatchupReplication, phaseStatusCatchupReplicationStopped}, 0)

// Re-enable redo logging.
if disabledRedoLog {
Expand All @@ -539,15 +579,18 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
if err := mysqld.Start(ctx, mycnf); err != nil {
return fmt.Errorf("Could not start MySQL after full shutdown: %v", err)
}
durationByPhase.Set("RestartBeforeBackup", int64(time.Since(restartAt).Seconds()))
deprecatedDurationByPhase.Set("RestartBeforeBackup", int64(time.Since(restartAt).Seconds()))
}

// Now we can take a new backup.
backupAt := time.Now()
phase.Set(phaseNameTakeNewBackup, int64(1))
defer phase.Set(phaseNameTakeNewBackup, int64(0))
if err := mysqlctl.Backup(ctx, backupParams); err != nil {
return fmt.Errorf("error taking backup: %v", err)
}
durationByPhase.Set("TakeNewBackup", int64(time.Since(backupAt).Seconds()))
deprecatedDurationByPhase.Set("TakeNewBackup", int64(time.Since(backupAt).Seconds()))
phase.Set(phaseNameTakeNewBackup, int64(0))

// Return a non-zero exit code if we didn't meet the replication position
// goal, even though we took a backup that pushes the high-water mark up.
Expand Down
1 change: 1 addition & 0 deletions go/flags/endtoend/vtbackup.txt
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ Usage of vtbackup:
--mysql_server_version string MySQL server version to advertise. (default "8.0.30-Vitess")
--mysql_socket string path to the mysql socket
--mysql_timeout duration how long to wait for mysqld startup (default 5m0s)
--opentsdb_uri string URI of opentsdb /api/put method
--port int port for the server
--pprof strings enable profiling
--purge_logs_interval duration how often try to remove old logs (default 1h0m0s)
Expand Down
23 changes: 23 additions & 0 deletions go/stats/counters.go
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,29 @@ func (g *GaugesWithSingleLabel) Set(name string, value int64) {
g.counters.set(name, value)
}

// SyncGaugesWithSingleLabel is a GaugesWithSingleLabel that proactively pushes
// stats to push-based backends when Set is called.
type SyncGaugesWithSingleLabel struct {
GaugesWithSingleLabel
name string
}

// NewSyncGaugesWithSingleLabel creates a new SyncGaugesWithSingleLabel.
func NewSyncGaugesWithSingleLabel(name, help, label string, tags ...string) *SyncGaugesWithSingleLabel {
return &SyncGaugesWithSingleLabel{
GaugesWithSingleLabel: *NewGaugesWithSingleLabel(name, help, label, tags...),
name: name,
}
}

// Set sets the value of a named gauge.
func (sg *SyncGaugesWithSingleLabel) Set(name string, value int64) {
sg.GaugesWithSingleLabel.Set(name, value)
if sg.name != "" {
_ = pushOne(sg.name, &sg.GaugesWithSingleLabel)
}
}

// GaugesWithMultiLabels is a CountersWithMultiLabels implementation where
// the values can go up and down.
type GaugesWithMultiLabels struct {
Expand Down
28 changes: 20 additions & 8 deletions go/stats/export.go
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,22 @@ func Publish(name string, v expvar.Var) {
publish(name, v)
}

func pushAll() error {
backend, ok := pushBackends[statsBackend]
if !ok {
return fmt.Errorf("no PushBackend registered with name %s", statsBackend)
}
return backend.PushAll()
}

func pushOne(name string, v Variable) error {
backend, ok := pushBackends[statsBackend]
if !ok {
return fmt.Errorf("no PushBackend registered with name %s", statsBackend)
}
return backend.PushOne(name, v)
}

// StringMapFuncWithMultiLabels is a multidimensional string map publisher.
//
// Map keys are compound names made with joining multiple strings with '.',
Expand Down Expand Up @@ -183,8 +199,10 @@ func publish(name string, v expvar.Var) {
// to be pushed to it. It's used to support push-based metrics backends, as expvar
// by default only supports pull-based ones.
type PushBackend interface {
// PushAll pushes all stats from expvar to the backend
// PushAll pushes all stats from expvar to the backend.
PushAll() error
// PushOne pushes a single stat from expvar to the backend.
PushOne(name string, v Variable) error
}

var pushBackends = make(map[string]PushBackend)
Expand Down Expand Up @@ -214,13 +232,7 @@ func emitToBackend(emitPeriod *time.Duration) {
ticker := time.NewTicker(*emitPeriod)
defer ticker.Stop()
for range ticker.C {
backend, ok := pushBackends[statsBackend]
if !ok {
log.Errorf("No PushBackend registered with name %s", statsBackend)
return
}
err := backend.PushAll()
if err != nil {
if err := pushAll(); err != nil {
// TODO(aaijazi): This might cause log spam...
log.Warningf("Pushing stats to backend %v failed: %v", statsBackend, err)
}
Expand Down
Loading

0 comments on commit ebbd418

Please sign in to comment.