Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] Service startup failing on CI with otel dependencies linked #4976

Closed
michalpristas opened this issue Jun 21, 2024 · 7 comments · Fixed by #6068
Closed

[Windows] Service startup failing on CI with otel dependencies linked #4976

michalpristas opened this issue Jun 21, 2024 · 7 comments · Fixed by #6068
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@michalpristas
Copy link
Contributor

michalpristas commented Jun 21, 2024

At the moment, Agent is unable to run in OTel mode on Windows — support for Windows was explicitly removed by adding build constraints as part of #4908. If we remove these constraints we see failures in CI like so:
failed to start service (Elastic Agent): The service did not respond to the start or control request in a timely fashion.

One of the reason may be late reaction to the control events from a service manager and invocation of service.ProcessWindowsControlEvents as described in #4971

Failures are present even if we move this to init function of alphabetically very early dependencies in main.
Timing here is also tricky, as dependencies get initialized before package linking them. so calling something in init does not mean it's being called first.

Event Log for elastic agent seems empty. Occasionally in linked draft above we see
24304 Jun 21 05:05 Error Elastic Agent 1 The description for Event ID '1' in Source 'Ela...
that could uncover some failures, but this is not present everywhere. But as i had this timeboxed i stopped here

https://techcommunity.microsoft.com/t5/iis-support-blog/the-description-for-event-id-1-from-source-getaccess-iis-runtime/ba-p/3744637

@michalpristas michalpristas added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jun 21, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@michalpristas
Copy link
Contributor Author

michalpristas commented Jun 21, 2024

@leehinman after I spent all my timboxed time and talked to Julien and Pierre I'll let you to take over

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jun 21, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@leehinman
Copy link
Contributor

Note to self https://pkg.go.dev/runtime the inittrace=1 option should give insight into where time is being spent.

@ycombinator ycombinator changed the title Service startup failing on CI with otel dependencies linked [Windows] Service startup failing on CI with otel dependencies linked Jun 21, 2024
@leehinman
Copy link
Contributor

So I'm not entirely sure it is init that is causing the problem. I set GODEBUG=inittrace=1. And did elastic-agent run under Windows 2016.

Version before 4908

This took 406ms to get through init

tail -n 1 before_k8_4ccdd0.log
init github.com/elastic/elastic-agent/internal/pkg/agent/cmd @406 ms, 0 ms clock, 18208 bytes, 123 allocs

And one package took up over half that time, the k8s.io/api/extensions/v1beta1 package

awk '$5 > 0 {print $5, $2}' before_k8_4ccdd0.log | sort -rn
244 k8s.io/api/extensions/v1beta1
18 go.elastic.co/apm
12 github.com/aws/aws-sdk-go/aws/endpoints
10 github.com/jedib0t/go-pretty/v6/text
8.6 github.com/shirou/gopsutil/process
8.2 github.com/prometheus/client_golang/prometheus
8.0 k8s.io/client-go/tools/clientcmd
7.2 github.com/elastic/elastic-agent/internal/pkg/agent/application
5.3 github.com/shirou/gopsutil/cpu
4.1 go.elastic.co/apm/v2
4.0 k8s.io/client-go/kubernetes/scheme
3.8 github.com/elastic/elastic-agent-libs/match
2.0 go.elastic.co/apm/stacktrace
1.9 k8s.io/apimachinery/pkg/util/validation
1.9 github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl
1.9 github.com/hashicorp/go-version
1.7 internal/poll
1.0 k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json
1.0 gopkg.in/yaml%2ev3
1.0 go.opentelemetry.io/collector/service/internal/zpages
1.0 go.opencensus.io/trace/tracestate
1.0 google.golang.org/protobuf/reflect/protodesc
1.0 google.golang.org/grpc/codes
1.0 github.com/prometheus/common/expfmt
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor/internal/gcp
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor/internal/env
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/parser/container
1.0 github.com/google/go-cmp/cmp
1.0 github.com/elastic/go-sysinfo/providers/linux
1.0 github.com/elastic/go-elasticsearch/v7/estransport
1.0 github.com/elastic/elastic-agent/version
1.0 github.com/elastic/elastic-agent/internal/pkg/agent/transpiler
1.0 github.com/distribution/reference
1.0 crypto/tls
0.99 sigs.k8s.io/structured-merge-diff/v4/typed
0.99 k8s.io/client-go/tools/clientcmd/api/latest
0.90 runtime/metrics
0.71 github.com/shirou/gopsutil/v3/cpu
0.62 golang.org/x/net/trace
0.23 crypto/x509

Version with 4908 but remove the windows build contraints

This only took 206ms in init

tail -n 1 after_k8_with_windows.log
init github.com/elastic/elastic-agent/internal/pkg/agent/cmd @206 ms, 0 ms clock, 18208 bytes, 123 allocs

And this time it was github.com/leoluk/perflib_exporter/perflib that took up most of the time.

151 github.com/leoluk/perflib_exporter/perflib
9.0 go.elastic.co/apm
5.9 github.com/jedib0t/go-pretty/v6/text
3.9 k8s.io/client-go/kubernetes/scheme
2.9 github.com/aws/aws-sdk-go/aws/endpoints
1.1 go.elastic.co/apm/v2
1.0 time
1.0 k8s.io/klog/v2
1.0 k8s.io/client-go/tools/clientcmd
1.0 k8s.io/api/flowcontrol/v1beta3
1.0 k8s.io/api/core/v1
1.0 go.opencensus.io/resource
1.0 go.elastic.co/apm/v2/stacktrace
1.0 github.com/shirou/gopsutil/process
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor/internal/env
1.0 github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl
1.0 github.com/fatih/color
1.0 github.com/elastic/go-structform/gotype
1.0 github.com/elastic/elastic-agent/internal/pkg/agent/application/paths
1.0 github.com/distribution/reference
1.0 github.com/aws/aws-sdk-go/aws/request
1.0 github.com/alecthomas/participle/v2/lexer
0.99 k8s.io/apimachinery/pkg/runtime
0.99 internal/poll
0.99 google.golang.org/protobuf/internal/detrand
0.98 github.com/elastic/go-sysinfo/providers/linux
0.59 gopkg.in/yaml%2ev2

Version 4908 with windows constraints

This went through init much faster, only 48ms.

tail -n 1 after_k8_no_windows.log
init github.com/elastic/elastic-agent/internal/pkg/agent/cmd @48 ms, 0 ms clock, 18224 bytes, 123 allocs

With go.elastic.co/apm taking up most of the time

14 go.elastic.co/apm
4.9 github.com/jedib0t/go-pretty/v6/text
2.0 go.elastic.co/apm/v2
1.9 k8s.io/client-go/kubernetes/scheme
1.0 sigs.k8s.io/structured-merge-diff/v4/typed
1.0 k8s.io/api/policy/v1
1.0 k8s.io/api/autoscaling/v2
1.0 k8s.io/api/apidiscovery/v2beta1
1.0 internal/poll
1.0 github.com/elastic/go-sysinfo/providers/linux
1.0 github.com/elastic/elastic-agent-libs/logp
1.0 github.com/elastic/elastic-agent-client/v7/pkg/proto
1.0 github.com/distribution/reference
0.99 google.golang.org/protobuf/reflect/protodesc
0.99 github.com/kardianos/service
0.99 github.com/hectane/go-acl/api
0.99 github.com/elastic/go-structform/gotype
0.98 github.com/shirou/gopsutil/process

So #4908 actually made init faster than it was before (even without the build constraints) so it seems unlikely that increasing init time was the root cause.

@leehinman
Copy link
Contributor

Next directions to look at

golang/go#23479, specifically https://go-review.googlesource.com/c/sys/+/246317/3/windows/svc/sys_amd64.s and shirou/gopsutil#570. These seem to suggest that DLL initialization may be an issue for us.

@michalpristas
Copy link
Contributor Author

michalpristas commented Jul 10, 2024

this is a really nice finding. not sure what we can do about that at this point though. i don't think we can optimize for DLL load time, especially in dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants