[Observability AI Assistant] [Root Cause Analysis] Create an evaluation tool for LLM root cause analysis results #200064

dominiqueclarke · 2024-11-13T18:28:29Z

As we begin to evaluate LLM assisted root cause analysis, we need a way to be able to evaluate the validity and usefulness of the results.

Historically, our process for evaluating these results has been quite manual. We've

Create a failure scenario where the root cause is known
Launched the LLM analysis while manually collecting all trace logs from the inference plugin
Manually scanned the output for any mention of our known root cause

As we begin to test more scenarios against an LLM assisted root cause analysis, we'd like a more automated and sensible way to evaluate the results.

We'd like to evaluate both controlled (known architecture, known root cause) and uncontrolled (unknown architecture, unknown root cause) scenarios. Here's some of what we envision.

Ability to generate a diagnostic file containing LLM output and available plugin logs.
- Ideally, this diagnostic file would be able to be downloaded via the UI, perhaps as part of the exiting observability inspect framework, or something else. This would enable our partners (SREs, customers) to gather information we can use to help improve the product without hassle.
Ability to specify what we believe the root cause to be. This could be provided by us (for controlled tests) or by our partners (customers, SREs, for uncontrolled tests).
- Ideally, this could be specified at the time the diagnostic file is downloaded from the UI. That way, we could download a full diagnostic file that includes both the actual LLM responses and the specified expected root cause.
Ability to evaluate how well we believe the output matches the expected root cause.
We are aware of this framework for evaluating various scenarios with the LLM integration, but it's unclear if this can be adapted to fit our use case. We are intrigued by the idea of utilizing AI to help use evaluate the output of the investigation, for example providing the LLM with the final report and asking it how well does this final report reflect the known root cause. We'd be interested in a system that allows us to provide the given diagnostic file and output how well it believes the LLM performed.

elasticmachine · 2024-11-13T18:28:31Z

Pinging @elastic/obs-ai-assistant (Team:Obs AI Assistant)

elasticmachine · 2024-11-13T18:28:32Z

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

emma-raffenne · 2024-11-18T10:44:02Z

hi @dominiqueclarke

I'd suggest to have a look at the evaluation framework implemented by @almudenasanz the Obs AI Assistant team is using https://github.com/elastic/kibana/tree/main/x-pack/plugins/observability_solution/observability_ai_assistant_app/scripts/evaluation

It may make sense to use it as a base and extend it for your needs. Let us know what you think.

dominiqueclarke added Team:Obs AI Assistant Observability AI Assistant Team:obs-ux-management Observability Management User Experience Team labels Nov 13, 2024

dominiqueclarke self-assigned this Jan 7, 2025

dominiqueclarke mentioned this issue Jan 7, 2025

[Investigate App] [AI RCA] [META] Evaluating AI performance for RCA #205670

Open

emma-raffenne added the epic label Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Observability AI Assistant] [Root Cause Analysis] Create an evaluation tool for LLM root cause analysis results #200064

[Observability AI Assistant] [Root Cause Analysis] Create an evaluation tool for LLM root cause analysis results #200064

dominiqueclarke commented Nov 13, 2024 •

edited

Loading

elasticmachine commented Nov 13, 2024

elasticmachine commented Nov 13, 2024

emma-raffenne commented Nov 18, 2024

[Observability AI Assistant] [Root Cause Analysis] Create an evaluation tool for LLM root cause analysis results #200064

[Observability AI Assistant] [Root Cause Analysis] Create an evaluation tool for LLM root cause analysis results #200064

Comments

dominiqueclarke commented Nov 13, 2024 • edited Loading

elasticmachine commented Nov 13, 2024

elasticmachine commented Nov 13, 2024

emma-raffenne commented Nov 18, 2024

dominiqueclarke commented Nov 13, 2024 •

edited

Loading