Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to auto-instrument with the OTel Java SDK #1704

Merged
merged 23 commits into from
Feb 28, 2025

Conversation

grcevski
Copy link
Contributor

The Hotspot JVM allows for dynamic loading of agents and this provides a way for auto-instrumentation of Java applications without the use of eBPF, but by loading the OpenTelemetry Java agent dynamically. This PR introduces an experimental support for using the OpenTelemetry agent instead of eBPF when instrumenting Java applications.

The way this is implemented is as follows:

  • Beyla does its normal scan for services to instrument based on the discovery criteria.
  • If a Java service is discovered (and the experimental OTel agent support is enabled), we enter the namespace of the Java service and then run a few jcmd commands.
  • First we check if the version of the JVM is supported. At the moment JDK 25 is not supported by OTel instrumentation, since Java 25 is still in early-access mode.
  • Next, we walk the classes loaded by the JVM and verify if it doesn't already load the OpenTelemetry agent.
  • Finally, we copy the OpenTelemetry agent to the temporary directory on the running container and dynamically load the agent, while passing a translation of the Beyla options to the OTLP Java agent options.

@grcevski grcevski requested a review from a team as a code owner February 26, 2025 19:28
Copy link

codecov bot commented Feb 26, 2025

Codecov Report

Attention: Patch coverage is 60.51661% with 107 lines in your changes missing coverage. Please review.

Project coverage is 69.68%. Comparing base (998ab56) to head (740f620).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/internal/otelsdk/sdk_inject.go 51.25% 80 Missing and 17 partials ⚠️
pkg/export/otel/common.go 60.00% 4 Missing and 4 partials ⚠️
pkg/export/otel/grafana.go 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1704      +/-   ##
==========================================
+ Coverage   65.29%   69.68%   +4.38%     
==========================================
  Files         196      210      +14     
  Lines       20086    21282    +1196     
==========================================
+ Hits        13116    14831    +1715     
+ Misses       6152     5721     -431     
+ Partials      818      730      -88     
Flag Coverage Δ
integration-test 54.47% <58.67%> (+0.05%) ⬆️
k8s-integration-test 53.93% <17.34%> (-0.62%) ⬇️
oats-test 34.79% <15.49%> (-0.29%) ⬇️
unittests 45.77% <14.39%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@rafaelroquetto rafaelroquetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is legit great!
I left a bunch of nits/suggestions, feel free to ignore. I guess the only potential issue is the assumption of "." as the current working directory - let me know what you think

Comment on lines +76 to +81
sdkInstrumented := false
if ta.sdkInjectionPossible(&instr.Obj) {
if err := ta.sdkInjector.NewExecutable(&instr.Obj); err == nil {
sdkInstrumented = true
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a cosmetic nit - if you wrap this into a function, you can reword this as

 sdkInstrumented := ta.trySDKInjection(&instr.Obj)

Copy link
Contributor

@mariomac mariomac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's amazing!!!

I'd raise some concerns/questions, not to be addressed right now in this PR, but to think about in the future:

  • What happens if we stop Beyla? Is the agent unloaded or it keeps running and instrumenting until the service is restarted?
  • If Beyla is restarted with a new configuration (for example, a new OTEL endpoint or credentials), the new configuration should be applied in the target service. This might require a way to know when a service has been instrumented "manually" or when the instrumentation has been injected from Beyla.
  • Having the JAR file embedded in our repository might look a bit obscure for some users. I'd discuss the option of either compile it from the sources or, download the binary from a trusted environment.

But anyway, this is an amazing leap forward!

Dockerfile Outdated
@@ -39,6 +40,7 @@ COPY --from=builder /opt/app-root/bin/beyla .
COPY --from=builder /opt/app-root/LICENSE .
COPY --from=builder /opt/app-root/NOTICE .
COPY --from=builder /opt/app-root/third_party_licenses.csv .
COPY --from=builder /opt/app-root/opentelemetry-javaagent.jar .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As something to improve in the future, we should find a way track the javaagent version and include it in our vulnerability scans.

"github.com/grafana/beyla/v2/pkg/internal/svc"
)

const OTelJDKAgent = "opentelemetry-javaagent.jar"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a future task, I'd set this value configurable so the user can provide its own version of the java-agent (e.g. a patched version, or an updated version to temporarily fix a discovered vulnerability).

@fstab
Copy link
Member

fstab commented Feb 27, 2025

not to be addressed right now in this PR, but to think about in the future

Actually, I think we should address the issues @mariomac brought up before we merge this PR. So far Beyla is low risk: If instrumentation causes issues, you can always stop Beyla and everything's back to normal. If that changes because attaching the JVM agent can't be undone, we need an idea what to do if the agent causes issues or if there is a misconfiguration (wrong OTLP endpoint, too high sampling rate, etc).

@grcevski
Copy link
Contributor Author

grcevski commented Feb 28, 2025

  • What happens if we stop Beyla? Is the agent unloaded or it keeps running and instrumenting until the service is restarted?

Good point, I think we can work on fixing this in follow-up PRs. I have some ideas, I think we should be able to supply set of options to make the agent effectively disabled.

  • If Beyla is restarted with a new configuration (for example, a new OTEL endpoint or credentials), the new configuration should be applied in the target service. This might require a way to know when a service has been instrumented "manually" or when the instrumentation has been injected from Beyla.

Great suggestion, I think we can do this. Right now we avoid double instrumenting, but we can do better than this. We should be able to tell if this agent was loaded on the command line or not, so to tell if it was managed by Beyla or not.

  • Having the JAR file embedded in our repository might look a bit obscure for some users. I'd discuss the option of either compile it from the sources or, download the binary from a trusted environment.

Rafael pushed changes to this PR and resolved this issue and the stuff that was bugging him :).

Copy link
Contributor

@rafaelroquetto rafaelroquetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Groovy!

@grcevski
Copy link
Contributor Author

not to be addressed right now in this PR, but to think about in the future

Actually, I think we should address the issues @mariomac brought up before we merge this PR. So far Beyla is low risk: If instrumentation causes issues, you can always stop Beyla and everything's back to normal. If that changes because attaching the JVM agent can't be undone, we need an idea what to do if the agent causes issues or if there is a misconfiguration (wrong OTLP endpoint, too high sampling rate, etc).

We are not going to document this option for now, so that we can work on making it more solid. As you and Mario mentioned there are edge-cases we need to consider. But if we did it all at once it would be a much larger PR and complex. This way we can iterate on this until we are confident we can make this option suitable for customers to turn on.

@@ -59,4 +59,7 @@ type EBPFTracer struct {

// Enables debug printing of the protocol data
ProtocolDebug bool `yaml:"protocol_debug_print" env:"BEYLA_PROTOCOL_DEBUG_PRINT"`

// Enables Java instrumentation with the OpenTelemetry JDK Agent
UseOTelSDKForJava bool `yaml:"use_otel_sdk_for_java" env:"BEYLA_USE_OTEL_SDK_FOR_JAVA"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the code above, I assume that enabling this would only generate OTEL metrics or traces. I think we should validate that if Prometheus is enabled this option can't be enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this happens right now:
https://github.com/grafana/beyla/pull/1704/files#diff-d6000537195fb5cd7aa9e0b6bea705c332ce53d5db9e070e39dffcdc191562e3R65

With prometheus scrape Metrics will be false, so we'll program it to send traces only.

@grcevski grcevski merged commit c9291b2 into grafana:main Feb 28, 2025
12 checks passed
@grcevski grcevski deleted the docker_utils branch February 28, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants