Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for exemplars in metrics UI #4629

Merged
merged 1 commit into from
Jul 12, 2024
Merged

Conversation

JamesNK
Copy link
Member

@JamesNK JamesNK commented Jun 24, 2024

Fixes #336
Fixes #4639

tldr example tracing data now shows up on metrics graphs as points. Click on a point to navigate to the trace UI. The $5 name for this feature is exemplars.

Done:

  • Captures exemplars from OTLP metrics data
  • Adds plotly trace bound to the tracing points
  • Mouse over a point shows a tooltip summary of the tracing data if available. It's possible the metrics data arrives before tracing data. If it's not available yet, then the trace ID is shown.
  • Clicking on the trace point navigates to the trace UI, displays the linked trace, and opens the span detail view.
  • Updates OTEL dependency to 1.9.0. Required for exemplar data from .NET

Todo:

  • Someone can click on an exemplar to view the trace detail before the trace or span is received by the dashboard. This primarily happens because OTEL includes exemplars with metrics immediately but doesn't send a span until it is finished. What should happen? One option is a loading indicator + message somewhere. Dialog is displayed on metrics page that data isn't ready. Navigates when data is ready, or can be canceled by user.
  • By default, opentelemetry-dotnet doesn't send exemplars with metrics for performance reasons. We can either enable this feature by updating the service defaults with new C# configuration or set an OTEL env var in Aspire. I think the OTEL env var should always be set in local development, and I'm undecided about defaulting to on in service defaults. It depends whether we think this should be on in production. Set env var in development

Demo:

exemplars

Microsoft Reviewers: Open in CodeFlow

@JamesNK
Copy link
Member Author

JamesNK commented Jun 24, 2024

@mitchdenny
Copy link
Member

When you have a lot of data points all on top of one another its going to get hard to select the right one. Are there any user interface patterns for solving this? I know in some CAD systems (which also have the problem of clusters of points/planes/etc overlapping) is to click-hold and get a context menu with all the nearby points of interest to then select.

@JamesNK
Copy link
Member Author

JamesNK commented Jun 24, 2024

I don't know if the chart gives that level of control. The idea of exemplars seems like they're a way to access an example of data. It seems inevitable that two trace points with almost the same value and almost the same time are on the graph. If you want to be able to access every trace, the trace view provides UI for that.

Alternatively, the metrics table view could be updated to provide links to all exemplars in a list. I haven't touch the metrics table view yet.

@samsp-msft
Copy link
Member

The point of exemplars is to provide a sampling of values, and not on every request. The OTel API has a way for specifying the sampler. We should probably setup this in ServiceDefaults to have a high sample rate during debug, and more sensible for production.

@reyang @cijothomas @CodeBlanch

@cijothomas
Copy link

The point of exemplars is to provide a sampling of values, and not on every request.

That is true. The default in OTel .NET SDK is to collect at most 1 Exemplar per collection for Non-Histograms, and at most N Exemplars per collection for Histograms, where N is the number of histogram buckets.

UI's can further filter the Exemplars to make UI less cluttered, esp. when multiple instances are reporting Exemplars..

@JamesNK
Copy link
Member Author

JamesNK commented Jun 24, 2024

What do you think about defaulting exemplars to on in production? Good or bad?

Docs say opentelemetry-dotnet defaults them to off, against recommended conventions, because of performance. Do you have more details to help us make a decision?

@reyang
Copy link

reyang commented Jun 24, 2024

What do you think about defaulting exemplars to on in production? Good or bad?

Having it on-by-default for histograms seems very useful (and the cost seems reasonable). For counters/gauges, I don't know how many users would want to use exemplars.

Docs say opentelemetry-dotnet defaults them to off, against recommended conventions, because of performance. Do you have more details to help us make a decision?

Check this open-telemetry/opentelemetry-specification#3870 (comment)

@CodeBlanch
Copy link

Having it on-by-default for histograms seems very useful (and the cost seems reasonable).

@JamesNK Just FYI in case you haven't noticed in .NET we have a special feature to enable exemplars just for histograms. Look for OTEL_DOTNET_EXPERIMENTAL_METRICS_EXEMPLAR_FILTER_HISTOGRAMS on here: https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics/customizing-the-sdk#exemplarfilter

@samsp-msft
Copy link
Member

What do you think about defaulting exemplars to on in production? Good or bad?

Having it on-by-default for histograms seems very useful (and the cost seems reasonable). For counters/gauges, I don't know how many users would want to use exemplars.

Docs say opentelemetry-dotnet defaults them to off, against recommended conventions, because of performance. Do you have more details to help us make a decision?

Check this open-telemetry/opentelemetry-specification#3870 (comment)

I'm wondering if we should be going with Trace-based exemplars, and adding something smarter for the trace sampling, which currently defaults to 100%. The overhead to metrics for the exemplar is large compared to typical aggregated metric data, but should be small in relation to the payload for each trace.

We probably have too high a default trace collection rate for production usage.

@kvenkatrajan
Copy link
Member

kvenkatrajan commented Jun 24, 2024

@JamesNK thanks for putting this together so quickly..looks great! Did some initial testing around this today and from a data perspective things look great. A few things we would need to review with the UX board on :

  • The size/color of the icon used for the examplar

  • @mitchdenny's comment above on discoverability

When you have a lot of data points all on top of one another its going to get hard to select the right one. Are there any user interface patterns for solving this?

CC: @leslierichardson95

Additionally from an accessbility perspective, how do we envision the data being displayed on the table view? Would the user need to filter by http.route to display the link to the span element? Please add an issue for tracking the display in table view.

@JamesNK
Copy link
Member Author

JamesNK commented Jun 24, 2024

Having it on-by-default for histograms seems very useful (and the cost seems reasonable).

@JamesNK Just FYI in case you haven't noticed in .NET we have a special feature to enable exemplars just for histograms. Look for OTEL_DOTNET_EXPERIMENTAL_METRICS_EXEMPLAR_FILTER_HISTOGRAMS on here: open-telemetry/opentelemetry-dotnet@main/docs/metrics/customizing-the-sdk#exemplarfilter

@CodeBlanch Optimizing for performance isn't a large concern in development. I think we can set OTEL_METRICS_EXEMPLAR_FILTER to trace_based based by default.

I'm wondering if we should be going with Trace-based exemplars, and adding something smarter for the trace sampling, which currently defaults to 100%. The overhead to metrics for the exemplar is large compared to typical aggregated metric data, but should be small in relation to the payload for each trace.

@samsp-msft Aspire sets trace sampling to off in development (i.e. collect all traces). After deployment, it uses whatever is the opentelemetry-dotnet default is for trace sampling.

@JamesNK
Copy link
Member Author

JamesNK commented Jun 24, 2024

UI's can further filter the Exemplars to make UI less cluttered, esp. when multiple instances are reporting Exemplars..

@cijothomas Thanks for the info! I think some UI filtering will be needed. For example, if someone sets the metrics chart timespan to one day, there could be a huge number of exemplars across that time. Thousands of points squeezed onto a graph won't be very useful.

Do you know what strategies they use? Options I can think of:

  • Have an even distribution of points across the graph (i.e. filter exemplars so a max of X is displayed for each graph tick)
  • Or have the graph support a maximum of X points and randomly sample the exemplars until that limit is hit. Could still end up with a cluster of points if most of them occured at the same time.

I think an even distribution would be more usable and work better in a graph updating in realtime.

@JamesNK
Copy link
Member Author

JamesNK commented Jun 25, 2024

@JamesNK thanks for putting this together so quickly..looks great! Did some initial testing around this today and from a data perspective things look great. A few things we would need to review with the UX board on :

  • The size/color of the icon used for the examplar
  • @mitchdenny's comment above on discoverability

When you have a lot of data points all on top of one another its going to get hard to select the right one. Are there any user interface patterns for solving this?

CC: @leslierichardson95

Additionally from an accessbility perspective, how do we envision the data being displayed on the table view? Would the user need to filter by http.route to display the link to the span element? Please add an issue for tracking the display in table view.

Issue for table view: #4639

Can you start UX review process started and CC me? I haven't done it before so I'd like to learn at the same time.

@JamesNK
Copy link
Member Author

JamesNK commented Jun 25, 2024

All remaining work is done. Please review.

@noahfalk
Copy link
Member

Its really cool to have exemplars included in there 👍

As someone experienced in the area I can make a good guess what the UI is showing me, but I am wondering if less experienced users will also understand it. This is my guess at a novice user thinking process:

  • I see dots on the graph, but what do these dots mean? Normally a dot in a line graph visually is placed on the line and it means the graph has a specific y-value at this x-value. If the dots land on the line I'd probably (incorrectly) guess it means that, but then why aren't the dots at each place the line changes slope? If the dots aren't on the line then what are they?
  • OK, I asked my coworker and they explained what exemplars are. If I hover over a dot it tells me the value, but is the popup telling me the value of this particular exemplar or the value of the time-series overall at this point?
  • The dots are all close together, if I click on one which one did I actually hit? If I fidget the mouse can click the others?

Asking some less experienced users is probably better than my own guesses though. If these guesses are right I'm no UI expert but I can offer one suggestion that might help:
image

Hovering over or clicking on the "Exemplars" link could give more info about exemplars. Hovering over each diamond would give info about all the exemplars available near that point in time. The hover info could be a small summary like "4 example traces" and then clicking on it goes to new page where each exemplar is a table row with the metric value and clickable links to each trace. Alternately if you can do a little table with links inside the popup then you might be able to show all the exemplars inline without needing an intermediate page.

What do you think about defaulting exemplars to on in production? Good or bad?

I would think its good as long as the sampling rate on the exemplars gives a total data volume that is a small fraction of what is being used for traces. @reyang @cijothomas, do the exemplar sampling controls let people produce that result even if it is just approximate?

@JamesNK
Copy link
Member Author

JamesNK commented Jun 25, 2024

I'm not a fan of displaying them just on the x axis. A visual representation of exemplar value height on the graph is useful to pick a high value exemplar.

The dots are all close together, if I click on one which one did I actually hit? If I fidget the mouse can click the others?

You will always go to the trace whose details are displayed in the popup.

I see dots on the graph, but what do these dots mean?

I updated the chart to show it in the legend:
image

To be honest I doubt many people know what an exemplar is, but I guess the legend gives them a name to call to the points on the graph.

If I hover over a dot it tells me the value, but is the popup telling me the value of this particular exemplar or the value of the time-series overall at this point?

The popup includes the span name. I hope people connect the dots and know that it represents a span. If they don't, clicking on a point (indicated by the pointer cursor on mouse over) takes you to the span, so they'll learn pretty quickly.

@noahfalk
Copy link
Member

I updated the chart to show it in the legend:

I think that definitely helps. I still hope we can have some grouping or filtering to avoid dots that are spatially too close together.

@reyang
Copy link

reyang commented Jun 25, 2024

I updated the chart to show it in the legend:

I think that definitely helps. I still hope we can have some grouping or filtering to avoid dots that are spatially too close together.

+1 on grouping - when the dots are getting too close, collapse them similar to what a power outage map would do, and allow user to drill down:

image

@adamint
Copy link
Member

adamint commented Jun 25, 2024

@JamesNK just a note that if we merge this without addressing #4639, there will be a sev2 a11y bug filed on us after the next accessibility review.

@samsp-msft
Copy link
Member

I'm not a fan of displaying them just on the x axis. A visual representation of exemplar value height on the graph is useful to pick a high value exemplar.

+1 - the value of exemplars is to be able to look at the outliers and see why they are different from the others. Showing them with a vertical position makes it much easier to understand where that exemplar lives in relation to the overall values.

I updated the chart to show it in the legend:

I think that definitely helps. I still hope we can have some grouping or filtering to avoid dots that are spatially too close together.

If there are too many close together, that probably says you have too many values. I am less concerned about being able to select an individual one on that basis, especially if we have the table view, as you then get the list you are after.

If there are a bunch close together, then we could take you to the table view - but how to take you to a specific set of entries and highlight them is a new UX affordance we don't yet have.

@noahfalk
Copy link
Member

Showing them with a vertical position makes it much easier to understand where that exemplar lives in relation to the overall values.

If a counter has a sudden spike, the exemplars presumably all still have value 1. It just happens that there were a lot of counter increments in that time period. This strategy works for histograms but not for all metric types.

I am less concerned about being able to select an individual one on that basis

I agree, I'm also not that concerned about selecting specific exemplars. I'm more concerned that we shouldn't show too many exemplar data points in the UI if they are overly difficult to select or if the large number of them causes too much visual clutter.

@JamesNK
Copy link
Member Author

JamesNK commented Jun 28, 2024

Please review 🙏

Copy link

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @JamesNK @samsp-msft would it make sense for two of you to show a demo to the OpenTelemetry community during the community/maintainers meeting? I think it will help to raise the awareness and also collect feedback from users.

@JamesNK
Copy link
Member Author

JamesNK commented Jun 29, 2024

What time is it? Might not work for me. Ping us details on teams/email.

@noahfalk
Copy link
Member

I think the table is very nice 👍 Is there anything that will prevent the graph from becoming cluttered with many exemplars if the app is running under sustained load? It seemed like if P50 stayed low but not zero it might get obscured behind lots of exemplar dots all concentrated near the x-axis.

@kvenkatrajan
Copy link
Member

kvenkatrajan commented Jun 29, 2024 via email

Copy link
Member

@mitchdenny mitchdenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My approval is scoped to the app model side of things (very minor changes ... no concerns). Only thing I would suggest is regenerating the sample manifests under playground.

@JamesNK
Copy link
Member Author

JamesNK commented Jul 2, 2024

Is there anything that will prevent the graph from becoming cluttered with many exemplars if the app is running under sustained load?

I just added some sampling to the graph. It displays up to 20 points per tick. There are 30ish ticks which means approx 600 points max on a graph. The main purpose is to stop someone slamming an app with load and the graph trying to render thousands of points, killing browser performance.

It seemed like if P50 stayed low but not zero it might get obscured behind lots of exemplar dots all concentrated near the x-axis.

Nothing stops that. Graph view gives a visual overview and allows someone to see and click on data outliers. The table view is the place to look if you're interested in a particular span.

@samsp-msft
Copy link
Member

LGTM, @JamesNK @samsp-msft would it make sense for two of you to show a demo to the OpenTelemetry community during the community/maintainers meeting? I think it will help to raise the awareness and also collect feedback from users.

Yes, can do, when is it?

@timheuer
Copy link
Member

timheuer commented Jul 2, 2024

UI feedback: any way to padd numbers and/or align the visuals better? when icons are shown it's better they are more uniform aligned in table views rather than appear to shift position. Similarly (perhaps more nuanced because the number of exemplars changes) if the button for exemplars could be more uniform-width?

image

@JamesNK
Copy link
Member Author

JamesNK commented Jul 2, 2024

I can give the exemplar button a minimum width so an exemplar number between 1-999 has a consistent width. But at some point the button must grow to handle large numbers, so it won't always be consistent.

The icon placement is more difficult, and I won't touch in this PR. We tried lining them up in the past and depending on the data, it looked worse than the current layout.

Done:
image

@mitchdenny
Copy link
Member

You could right align the cells. That would make the icons line up at least.

@timheuer
Copy link
Member

timheuer commented Jul 3, 2024

The icon placement is more difficult

is the number always a 3 decimal value? e.g. instead of 0.01 make it 0.010 fixing/padding the decimal len? or move the icons to the other side?

@samsp-msft
Copy link
Member

@JamesNK can you add the changes to the ServiceDefaults project to enable exemplars, and setting the configuration to enable it just for histograms so it matches with the dashboard implementation.

@JamesNK
Copy link
Member Author

JamesNK commented Jul 10, 2024

@JamesNK can you add the changes to the ServiceDefaults project to enable exemplars, and setting the configuration to enable it just for histograms so it matches with the dashboard implementation.

Will do in a follow up PR - #4839

@JamesNK JamesNK force-pushed the jamesnk/metrics-exemplars branch from 0c7b0d5 to dfcd72a Compare July 11, 2024 01:18
@JamesNK
Copy link
Member Author

JamesNK commented Jul 12, 2024

Comparison of marker sizes.

Size 16 (current):

image

Size 14:

image

Size 12:

image

Size 10:

image

@JamesNK
Copy link
Member Author

JamesNK commented Jul 12, 2024

I think leaving it at 16 is good. For reference, 16 is the size of most icons throughout the app.
14 feels small but would work.
12 and 10 are too small.

@JamesNK
Copy link
Member Author

JamesNK commented Jul 12, 2024

I'm going to merge as is to unblock other work, and we can bike shed marker sizes and make the change in a follow up PR.

@JamesNK JamesNK merged commit de35684 into main Jul 12, 2024
9 checks passed
@JamesNK JamesNK deleted the jamesnk/metrics-exemplars branch July 12, 2024 01:44
@kvenkatrajan
Copy link
Member

kvenkatrajan commented Jul 12, 2024 via email

radical added a commit to radical/aspire that referenced this pull request Jul 12, 2024
commit 9071c84
Author: Ankit Jain <[email protected]>
Date:   Fri Jul 12 19:04:26 2024 -0400

    fix e2e

commit 68a7ae0
Merge: 09acf2d 681f2e7
Author: Ankit Jain <[email protected]>
Date:   Fri Jul 12 18:57:33 2024 -0400

    Merge remote-tracking branch 'origin/main' into tests-out-of-repo

commit 681f2e7
Author: Dan Moseley <[email protected]>
Date:   Fri Jul 12 14:04:57 2024 -0600

    Make credscan run clean (dotnet#4876)

    * suppressions

    * more

    * typo

commit a133411
Author: Dan Moseley <[email protected]>
Date:   Fri Jul 12 08:42:40 2024 -0600

    add TSA config (dotnet#4860)

    * tsaconfig

    * more

    * remove the config per Matt

    * add policheck in 1ES template

    * policheck exclusions

commit cf61a58
Author: James Newton-King <[email protected]>
Date:   Fri Jul 12 13:00:02 2024 +0800

    Load plotly as a module on metrics page (dotnet#4857)

commit 09acf2d
Author: Ankit Jain <[email protected]>
Date:   Fri Jul 12 00:34:04 2024 -0400

    fix archive path for e2e

commit da382ff
Merge: 18b92c3 de35684
Author: Ankit Jain <[email protected]>
Date:   Fri Jul 12 00:12:02 2024 -0400

    Merge remote-tracking branch 'origin/main' into tests-out-of-repo

commit 18b92c3
Author: Ankit Jain <[email protected]>
Date:   Fri Jul 12 00:07:52 2024 -0400

    merge from playground branch

commit de35684
Author: James Newton-King <[email protected]>
Date:   Fri Jul 12 09:43:58 2024 +0800

    Add support for exemplars in metrics UI (dotnet#4629)

commit e23cd4c
Merge: ef3158b 9421fc8
Author: Ankit Jain <[email protected]>
Date:   Thu Jul 11 18:00:20 2024 -0400

    Merge remote-tracking branch 'origin/main' into tests-out-of-repo

commit ef3158b
Author: Ankit Jain <[email protected]>
Date:   Thu Jul 11 03:27:52 2024 -0400

    wip

commit 7c61df1
Author: Ankit Jain <[email protected]>
Date:   Thu Jul 11 02:48:20 2024 -0400

    fixies mores

commit d6e21ab
Author: Ankit Jain <[email protected]>
Date:   Thu Jul 11 03:10:21 2024 -0400

    share targets

commit 01e419d
Merge: 0741157 421f8b1
Author: Ankit Jain <[email protected]>
Date:   Thu Jul 11 03:00:16 2024 -0400

    Merge remote-tracking branch 'origin/main' into tests-out-of-repo

    # Conflicts:
    #	tests/helix/send-to-helix-inner.proj

commit 0741157
Merge: 3406514 f647a66
Author: Ankit Jain <[email protected]>
Date:   Tue Jul 9 18:43:36 2024 -0400

    Merge remote-tracking branch 'origin/main' into tests-out-of-repo

commit 3406514
Author: Ankit Jain <[email protected]>
Date:   Mon Jul 8 21:18:23 2024 -0400

    fix build

commit 3489ff3
Author: Ankit Jain <[email protected]>
Date:   Mon Jul 8 20:23:10 2024 -0400

    fix props generation

commit 6a6315f
Author: Ankit Jain <[email protected]>
Date:   Mon Jul 8 20:14:53 2024 -0400

    fix

commit 92642ea
Author: Ankit Jain <[email protected]>
Date:   Mon Jul 8 19:36:17 2024 -0400

    rename

commit 4bea9f4
Author: Ankit Jain <[email protected]>
Date:   Mon Jul 8 19:21:36 2024 -0400

    cleanup

# Conflicts:
#	tests/Aspire.EndToEnd.Tests/Aspire.EndToEnd.Tests.csproj
#	tests/Shared/RepoTesting/Aspire.Testing.Repo.targets
@github-actions github-actions bot locked and limited conversation to collaborators Aug 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Display metrics exemplars in metrics table view Feature Idea: Connecting Metrics to Traces