Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basic alerting, for internal use primarily #778

Closed
wants to merge 5 commits into from

Conversation

heavycrystal
Copy link
Contributor

@heavycrystal heavycrystal commented Dec 7, 2023

For now, an alerting function logs an entry in an alerts table in catalog, and also dispatches notifications to one or more slack channels. Events are staggered to be 15 minutes apart at least, as long as the same "alert key" is used.

Only logs alerts for slot lag increasing a threshold, configured via the environment variable PEERDB_SLOT_LAG_MB_ALERT_THRESHOLD

@@ -79,6 +81,11 @@ func NewPostgresConnector(ctx context.Context, pgConfig *protos.PostgresConfig)
metadataSchema = *pgConfig.MetadataSchema
}

vigil, err := evervigil.NewVigil()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should NewVigil take ctx given it's calling Context.Background()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought having it be context.Background() would allow alerts to be sent even in cases where context cancellation happens, not sure if it'll work out in practice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if you want that level of isolation you'll want the monitor to be a separate container

@heavycrystal heavycrystal force-pushed the alerting-v1-evervigil branch from 79da726 to 8b2c0f0 Compare December 7, 2023 20:11
@iskakaushik
Copy link
Contributor

@heavycrystal could you limit this PR to just implement the alerting service for slot growth. Lets split out the temporal activity failure alerts.

@heavycrystal heavycrystal force-pushed the alerting-v1-evervigil branch from 90a9d43 to 0ad9858 Compare December 8, 2023 20:57
@heavycrystal heavycrystal marked this pull request as ready for review December 11, 2023 13:15

if int64(slotInfo[0].LagInMb) >= slotLagInMBThreshold {
a.Vigil.AlertIf(fmt.Sprintf("%s-slot-size-exceeded", peerName),
fmt.Sprintf("Slot %s on peer %s has exceeded threshold size of %dMB, currently at %fMB!",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt.Sprintf("Slot %s on peer %s has exceeded threshold size of %dMB, currently at %fMB!",
fmt.Sprintf("Slot %s on peer %s has exceeded threshold size of %dMB, currently at %.2fMB!",

serprex added a commit that referenced this pull request Dec 12, 2023
Connection pools are meant to be shared, so remove pool from monitoring

Also go one step further: cache connection pool from env.go into global
This requires never closing pool returned by GetCatalogConnectionPoolFromEnv

Run migrations for CI because `monitoring` logic is now non optional

Opens up reusing pool for #778
flow/cmd/worker.go Outdated Show resolved Hide resolved
Comment on lines +77 to +79
if ev.catalogPool != nil {
ev.catalogPool.Close()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catalog pool should never be closed

@@ -43,9 +43,6 @@ type CDCPullConnector interface {
// PullFlowCleanup drops both the Postgres publication and replication slot, as a part of DROP MIRROR
PullFlowCleanup(jobName string) error

// SendWALHeartbeat allows for activity to progress restart_lsn on postgres.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being removed because it's unused? Should be split out to another PR

@iskakaushik
Copy link
Contributor

please rename EverVigil to something like AlertingService or something more generic. I like the name, but it isn't obvious what it does.

@heavycrystal
Copy link
Contributor Author

reworked in #866

@serprex serprex deleted the alerting-v1-evervigil branch July 19, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants