Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] TestProxyURL fails with access denied error on fleet.enc #4913

Open
pchila opened this issue Jun 12, 2024 · 11 comments
Open

[Windows] TestProxyURL fails with access denied error on fleet.enc #4913

pchila opened this issue Jun 12, 2024 · 11 comments
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@pchila
Copy link
Member

pchila commented Jun 12, 2024

While validating PR #4770 TestProxyURL test fails with

{"log.level":"error","@timestamp":"2024-06-11T17:55:59.378Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":161},"message":"Failed to dispatch action id \"actionID-TestValidProxyInThePolicy\" of type \"POLICY_CHANGE\", error: saving config: fail to persist new Fleet Server API client hosts: could not replace target file C:\\Program Files\\Elastic\\Agent\\fleet.enc: rename C:\\Program Files\\Elastic\\Agent\\fleet.enc.tmp C:\\Program Files\\Elastic\\Agent\\fleet.enc: Access is denied.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

CI run -> https://buildkite.com/elastic/elastic-agent-extended-testing/builds/538#01900846-16ad-428c-8a60-5a48324d55f5

As a workaround the test now install elastic-agent as privileged on Windows (with this the test runs correctly) but it seems that there are some issues with fleet.enc permissions in unprivileged mode not directly related to the developed enhancement.

This issue is there to track this issue and collect results of further investigation,

@pchila pchila added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team labels Jun 12, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pchila
Copy link
Member Author

pchila commented Jun 12, 2024

Did some additional runs on a Windows VM... It seems that the issue is intermittent which suggests some kind of race condition that make the test fail sometimes on windows unprivileged. In the specific instance I managed to reproduce only once over 3 runs of the specific test on a Win10 VM

@cmacknz
Copy link
Member

cmacknz commented Jun 12, 2024

That this error has only ever been seen in #4770, and that PR also changes the policy change logic, makes me suspect it could be related to something in #4770. It is also possible you've added test coverage to something that was always flaky, and now we can observe it.

Access is denied is what you get if you try to remove a file while something else has it open. I don't think it necessarily has to do with the privilege level of the agent.

@blakerouse
Copy link
Contributor

Yeah I agree that its possible this is coming from this PR. I just looked at main to try and determine how I can reproduce this and I see that this test is skipped https://github.com/elastic/elastic-agent/blob/main/testing/integration/proxy_url_test.go#L35.

Happy to help work with you on this error and why you think its specifically related to unprivileged mode and permissions on Windows.

@pchila
Copy link
Member Author

pchila commented Jun 13, 2024

After further development done in #4770 the proxy url tests are enabled again (those were disabled due to #4861)

In #4770 I changed a log level in the agent to see what error may cause an action dispatch failure https://github.com/elastic/elastic-agent/pull/4770/files#diff-b1931bdf74ee7a85f4291c623d9a21868c438c9ef9829e8e939f2b43906b3535R159 (before it was a simple debug log which was not printed most of the time), so that may have revealed an error that we already have before the changes...

The code changes for mTLS did not add any extra reading/writing of fleet.enc so if there is a race condition, it's likely that we already have it in main...

I will keep investigating trying to figure out what triggers the access denied error

@pchila pchila changed the title [Windows unprivileged] TestProxyURL fails with access denied error on fleet.enc [Windows] TestProxyURL fails with access denied error on fleet.enc Jun 13, 2024
@pchila
Copy link
Member Author

pchila commented Jun 17, 2024

@pchila
Copy link
Member Author

pchila commented Jun 17, 2024

I am going to disable the ProxyURL tests on windows until the access denied is resolved

@pchila
Copy link
Member Author

pchila commented Jun 27, 2024

After a few tries we got a CI run that contains useful information about the issue: https://buildkite.com/elastic/elastic-agent-extended-testing/builds/810#01903606-b9b9-4977-bc0d-b2327f19b1f1

Grepping fleet.enc from the logs contained in the diagnostics we can see

➜  TestProxyURL-EnrollWithProxy-PolicyProxyTakesPrecedence-diagnostics-2024-06-20T14-59-38Z grep -ir fleet.enc ./* | sort
<omitted...>
{"file.name":"storage/encrypted_disk_store.go","file.line":129},"message":"Save of C:\\Program Files\\Elastic\\Agent\\fleet.enc started at 2024-06-20 14:54:17.257021 +0000 GMT m=+1.117881501\n","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.282Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_storage_debug_windows.go","file.line":41},"message":"C:\\Program Files\\Elastic\\Agent\\fleet.enc security descriptor: O:S-1-5-21-310561555-502049046-1354876720-1009G:S-1-5-21-310561555-502049046-1354876720-513","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.282Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_storage_debug_windows.go","file.line":61},"message":"C:\\Program Files\\Elastic\\Agent\\fleet.enc stat:\n{\n  \"FileAttributes\": 32,\n  \"CreationTime\": {\n    \"LowDateTime\": 3101609211,\n    \"HighDateTime\": 31114017\n  },\n  \"LastAccessTime\": {\n    \"LowDateTime\": 3101784772,\n    \"HighDateTime\": 31114017\n  },\n  \"LastWriteTime\": {\n    \"LowDateTime\": 3101784772,\n    \"HighDateTime\": 31114017\n  },\n  \"FileSizeHigh\": 0,\n  \"FileSizeLow\": 274,\n  \"ReparseTag\": 0\n}\n","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.288Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_storage_debug_windows.go","file.line":53},"message":"owner for \"C:\\\\Program Files\\\\Elastic\\\\Agent\\\\fleet.enc\": OGC-WINDOWS-AMD\\elastic-agent-user account type 1","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.288Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_store.go","file.line":131},"message":"Save of C:\\Program Files\\Elastic\\Agent\\fleet.enc finished at 2024-06-20 14:54:17.2883767 +0000 GMT m=+1.149237201\n","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-3.ndjson:{"log.level":"error","@timestamp":"2024-06-20T14:54:35.407Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":161},"message":"Failed to dispatch action id \"actionID-TestValidProxyInThePolicy\" of type \"POLICY_CHANGE\", error: saving config: fail to persist new Fleet Server API client hosts: could not replace target file C:\\Program Files\\Elastic\\Agent\\fleet.enc: rename C:\\Program Files\\Elastic\\Agent\\fleet.enc.tmp C:\\Program Files\\Elastic\\Agent\\fleet.enc: Access is denied.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

<omitted...>

Along with the logs from the test report

[ ===] Waiting For Enroll...  [4s] {"log.level":"warn","@timestamp":"2024-06-20T14:54:16.230Z","log.logger":"tls","log.origin":{"file.name":"tlscommon/tls_config.go","file.line":107},"message":"SSL/TLS verifications disabled.","ecs.version":"1.6.0"}
        {"log.level":"info","@timestamp":"2024-06-20T14:54:16.783Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":518},"message":"Starting enrollment to URL: http://fleet.elastic.co/","ecs.version":"1.6.0"}
        Load of C:\Program Files\Elastic\Agent\fleet.enc started at 2024-06-20 14:54:16.9998057 +0000 GMT m=+0.860666201Load of C:\Program Files\Elastic\Agent\fleet.enc finished at 2024-06-20 14:54:17.0038126 +0000 GMT m=+0.864673101{"log.level":"info","@timestamp":"2024-06-20T14:54:17.674Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":481},"message":"Restarting agent daemon, attempt 0","ecs.version":"1.6.0"}
        {"log.level":"info","@timestamp":"2024-06-20T14:54:17.683Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":299},"message":"Successfully triggered restart on running Elastic Agent.","ecs.version":"1.6.0"}
        Successfully enrolled the Elastic Agent.

If we look at the timestamp when the enroll command is restarting agent after writing fleet.enc 2024-06-20T14:54:17.674Z and the logs just before logging the access denied claiming that saving of fleet.enc completed at 2024-06-20 14:54:17.2883767 +0000 GMT we can assume that there's some interaction between writing from enroll and saving the very first policy data.

This appears in the tests for 2 reasons:

  • the mock fleet server is a local httptest.Server so no network delay is introduced
  • the mock fleet server send out immediately thhe policy change action (unlike a real fleet server)

On windows having such a short time where 2 processes try to modify the same file seems to cause the access denied error, maybe this can be mitigated by making the mock fleet server slow down a few milliseconds before sending out the action or having it resend on the next checkin if it's not been acknowledged.

@leehinman
Copy link
Contributor

oh joy.

In both cases I think we are using the file.SafeFileRotate to create a temp file and then rename it to the destination. On Windows the Rename ends up being a call to https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefileexw with MOVEFILE_REPLACE_EXISTING option. We probably need to make sure that file.SafeFileRotate doesn't have a race condition when called in rapid succession.

@blakerouse
Copy link
Contributor

We discuss this issue in the weekly meeting and I wanted to document it here. I believe the following change will result in this no longer having a race condition and overall will provide a better experience for installation.

  1. Write the installation directory (but do not install the service)
  2. Perform enrollment inside of the installation directory (but do not tell it to restart the running daemon)
  3. Install the service
  4. Start the service

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

6 participants