-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
basic alerting, for internal use primarily #778
Conversation
flow/connectors/postgres/postgres.go
Outdated
@@ -79,6 +81,11 @@ func NewPostgresConnector(ctx context.Context, pgConfig *protos.PostgresConfig) | |||
metadataSchema = *pgConfig.MetadataSchema | |||
} | |||
|
|||
vigil, err := evervigil.NewVigil() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should NewVigil
take ctx given it's calling Context.Background()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought having it be context.Background()
would allow alerts to be sent even in cases where context cancellation happens, not sure if it'll work out in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if you want that level of isolation you'll want the monitor to be a separate container
79da726
to
8b2c0f0
Compare
@heavycrystal could you limit this PR to just implement the alerting service for slot growth. Lets split out the temporal activity failure alerts. |
90a9d43
to
0ad9858
Compare
|
||
if int64(slotInfo[0].LagInMb) >= slotLagInMBThreshold { | ||
a.Vigil.AlertIf(fmt.Sprintf("%s-slot-size-exceeded", peerName), | ||
fmt.Sprintf("Slot %s on peer %s has exceeded threshold size of %dMB, currently at %fMB!", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt.Sprintf("Slot %s on peer %s has exceeded threshold size of %dMB, currently at %fMB!", | |
fmt.Sprintf("Slot %s on peer %s has exceeded threshold size of %dMB, currently at %.2fMB!", |
Connection pools are meant to be shared, so remove pool from monitoring Also go one step further: cache connection pool from env.go into global This requires never closing pool returned by GetCatalogConnectionPoolFromEnv Run migrations for CI because `monitoring` logic is now non optional Opens up reusing pool for #778
523c80b
to
10dd323
Compare
10dd323
to
95bb209
Compare
if ev.catalogPool != nil { | ||
ev.catalogPool.Close() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catalog pool should never be closed
@@ -43,9 +43,6 @@ type CDCPullConnector interface { | |||
// PullFlowCleanup drops both the Postgres publication and replication slot, as a part of DROP MIRROR | |||
PullFlowCleanup(jobName string) error | |||
|
|||
// SendWALHeartbeat allows for activity to progress restart_lsn on postgres. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being removed because it's unused? Should be split out to another PR
please rename EverVigil to something like |
reworked in #866 |
For now, an alerting function logs an entry in an alerts table in catalog, and also dispatches notifications to one or more slack channels. Events are staggered to be 15 minutes apart at least, as long as the same "alert key" is used.
Only logs alerts for slot lag increasing a threshold, configured via the environment variable
PEERDB_SLOT_LAG_MB_ALERT_THRESHOLD