Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit metrics related to how many tablets are owned by a CN #26

Merged
merged 2 commits into from
Feb 18, 2025

Conversation

ctbrennan
Copy link

Why I'm doing:

Want to emit these to dashboards so we can be relatively sure that tablets are well balanced across CN and no CN is too hot based on load assigned.

What I'm doing:

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • [] Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

public static final MetricWithLabelGroup<LongCounterMetric> COUNTER_CN_SELECTED_FOR_TABLET_SCAN =
new MetricWithLabelGroup<>("cn_id",
() -> new LongCounterMetric("cn_selected_for_tablet_scan", MetricUnit.NOUNIT,
"times this FE selected the given CN for a scan of some tablet"));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the correct way to interpret/use this metric be something like:

(metric at t)-(metric at t-1) is how many times an FE selected a particular CN node to scan any tablet in the past minute?

If (metric at t)-(metric at t-1) for a CN node 1 is greater than (metric at t)-(metric at t-1) for a CN node 2, then CN node 1 is hotter than node 2 given most other variables are held the same like total tablet size scanned.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. You've hit on part of its utility -- if this metric's value is relatively similar for all CN, but one CN has higher CPU util, we have an indication that some tablets are likely too big, or receiving too much ingestion, or the CN is otherwise doing some other work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, Thanks!

@ctbrennan ctbrennan merged commit 80ef98e into pinterest-integration-3.3 Feb 18, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants