Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reinitalise targets when parsing host list #313

Merged
merged 2 commits into from
Apr 23, 2024

Conversation

dedelala
Copy link

@dedelala dedelala commented Apr 23, 2024

Description

Fix an issue with discovery where we're appending to the list of targets but never reinitialising it. Dead hosts are never removed from the list and active ones are added multiple times.

Change metric types of active hosts from counters to gauges, we should only be tracking the current number not incrementing.

I have tested this out with a local build and discovery is working correctly when the host list changes

Comment on lines 175 to 179
contentsChanged, err := b.parse()
if err != nil || !contentsChanged {
if err != nil {
log.Error(err)
continue
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be logging these errors, not sure if we care about incrementing unchangedCount in this case

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to log once a second if there's a permanent error in the file on disk. Can we slow it down?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I had some logging in an earlier iteration and removed it because I was worried about spam volume.

I think we should set a formatError=true bit the first time we detect a parse error, and only log once when we enter that state, then when we properly parse the file, clear the bit so we log the next time.

Copy link

@henryr henryr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great catch. Just a concern about logging volume.

Comment on lines 175 to 179
contentsChanged, err := b.parse()
if err != nil || !contentsChanged {
if err != nil {
log.Error(err)
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to log once a second if there's a permanent error in the file on disk. Can we slow it down?

affinityCount.Add("local", local)
affinityCount.Add("remote", remote)
poolTypeCount.Add(r.poolType, int64(len(targets)))
affinityCount.Set("local", local)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As-is this isn't quite accurate, since the set is already filtered by target by this point. We could move the

I realize counters aren't quiite right either, but using gauges across the fleet in prometheus isn't great for aggregations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we could do is pivot the model so that instead of counting whenever we update a target, we could instead use something like stats.NewGaugesFuncWithMultiLabels so that we generate the stats on demand regardless of whether or not any clients are connected.

I think that would be a better model frankly, since it would exercise all the parsing code and we could verify the discovery layer is running as we expect before sending any traffic through the system.

We could also wire in a /debug/status page so we can see the full state as well.

@demmer
Copy link
Collaborator

demmer commented Apr 23, 2024

High level -- great find! I feel somewhat sheepish I did it this way in the first place.

My .02 is that we should separate out the fix for the issue from the other changes to the logging / metrics, merge and deploy the fix ASAP, and then work on the observability separately.

Copy link
Collaborator

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong +1 :)

@dedelala dedelala merged commit 11ccb3a into vtgateproxy Apr 23, 2024
152 of 241 checks passed
@dedelala dedelala deleted the esme-vtgateproxy-reinit-targets branch April 23, 2024 23:49
dedelala added a commit that referenced this pull request Jul 30, 2024
* reinitalise targets when parsing host list

* remove metrics and logging changes
dedelala added a commit that referenced this pull request Nov 12, 2024
* reinitalise targets when parsing host list

* remove metrics and logging changes
dedelala added a commit that referenced this pull request Jan 8, 2025
* reinitalise targets when parsing host list

* remove metrics and logging changes
dedelala added a commit that referenced this pull request Jan 29, 2025
* reinitalise targets when parsing host list

* remove metrics and logging changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants