Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalidate remote api keys on unenroll #3154

Merged
merged 11 commits into from
Dec 12, 2023

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Dec 7, 2023

What is the problem this PR solves?

If an agent using remote es output is unenrolled, the remote API key was not invalidated.

How does this PR solve the problem?

Enhanced the logic that invalidates api keys on unenroll to take into account the output, so that remote api keys are invalidated with the remote es client (bulker).

How to test this PR locally

Invalidate remote API key on agent unenroll

  1. start es,kibana,fleet-server locally with these changes, start another cluster for monitoring with 8.12-snapshot version
  2. add a remote es output in local cluster using the cluster2 es
  3. create an agent policy with remote output as monitoring output
  4. enroll an agent to the agent policy, wait until the API key is created on the remote cluster and data is showing up
  5. unenroll the agent, wait for agent to ack the action
  6. verify that remote API key is invalidated

these logs show up in Fleet Server:

15:51:26.349 INF ack event actionId=0a53d3df-d527-40bc-864d-ac37bf019e71 agentId=240f75a2-f073-4853-a53c-eb839a915579 ecs.version=1.6.0 fleet.access.apikey.id=DRL5RIwB93ftHy-WdwfN fleet.agent.id=240f75a2-f073-4853-a53c-eb839a915579 http.request.id=01HH2FM23D038N5FVQSKGDRR5G n=0 nEvents=1 server.address= service.name=fleet-server service.type=fleet-server timestamp=2023-12-07T15:51:26.061Z
15:51:26.918 INF handleUnenroll invalidate API keys ecs.version=1.6.0 fleet.access.apikey.id=DRL5RIwB93ftHy-WdwfN fleet.agent.id=240f75a2-f073-4853-a53c-eb839a915579 fleet.policy.apiKeyIDsToRetire=[{"id":"DRL5RIwB93ftHy-WdwfN"},{"id":"DxL5RIwB93ftHy-WgwdB"},{"id":"yZv5RIwB4likW8KqgJ7g","output":"634bdfad-4bc5-4f9a-b7af-357049e0ab14"}] http.request.id=01HH2FM23D038N5FVQSKGDRR5G nEvents=1 server.address= service.name=fleet-server service.type=fleet-server
15:51:26.918 INF Invalidate old API keys ecs.version=1.6.0 fleet.access.apikey.id=DRL5RIwB93ftHy-WdwfN fleet.agent.id=240f75a2-f073-4853-a53c-eb839a915579 fleet.policy.apiKeyIDsToRetire=["DRL5RIwB93ftHy-WdwfN","DxL5RIwB93ftHy-WgwdB"] http.request.id=01HH2FM23D038N5FVQSKGDRR5G nEvents=1 server.address= service.name=fleet-server service.type=fleet-server
15:51:28.413 INF ack unenroll ecs.version=1.6.0 fleet.access.apikey.id=DRL5RIwB93ftHy-WdwfN fleet.agent.id=240f75a2-f073-4853-a53c-eb839a915579 http.request.id=01HH2FM23D038N5FVQSKGDRR5G nEvents=1 server.address= service.name=fleet-server service.type=fleet-server

Invalidate remote API key on agent force unenroll

  1. follow steps 1-4. in the previous test
  2. force unenroll the agent, this results in kibana invalidating the access api key in the main cluster and the agent is set to inactive
  3. verify that fleet server invalidates the remote API keys on the next agent checkin. The agent can't access the remote ES anymore.

logs in Fleet Server:

10:02:28.274 INF agent record inactive error.message="agent inactive" ecs.version=1.6.0 fleet.access.apikey.id=G6R5XYwBwag5VSQ2PICs fleet.agent.id=862a126a-6688-40a0-9d8e-42d52efee813 http.request.id=01HHEQMNSFYVW5HWTT60TB5NG2 server.address= service.name=fleet-server service.type=fleet-server
10:02:28.274 INF handleCheckin invalidate remote API keys ecs.version=1.6.0 fleet.agent.id=862a126a-6688-40a0-9d8e-42d52efee813 fleet.policy.apiKeyIDsToRetire=[{"id":"mTh5XYwBeBiY1xB6SEjA","output":"61c5de86-7c49-41f0-acd7-0018c2f6d6bb"}] http.request.id=01HHEQMNSFYVW5HWTT60TB5NG2 server.address= service.name=fleet-server service.type=fleet-server

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Closes https://github.com/elastic/ingest-dev/issues/2739

@juliaElastic juliaElastic added the bug Something isn't working label Dec 7, 2023
@juliaElastic juliaElastic self-assigned this Dec 7, 2023
}
apiKeys := agent.APIKeyIDs()
zlog.Info().Any("fleet.policy.apiKeyIDsToRetire", apiKeys).Msg("handleUnenroll invalidate API keys")
ack.invalidateAPIKeys(ctx, zlog, apiKeys, "")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

building []ToRetireAPIKeyIdsItems to reuse the logic in invalidateAPIKeys to invalidate against remote es

@juliaElastic juliaElastic marked this pull request as ready for review December 11, 2023 13:40
@juliaElastic juliaElastic requested a review from a team as a code owner December 11, 2023 13:40
@juliaElastic
Copy link
Contributor Author

Do we require changelog if this pr is a fix for a feature in the same minor (8.12)?

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Dec 11, 2023

Fleet Server e2e tests failing, any ideas how to fix this?


ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 38822cc1-fce7-4a14-8000-191906c319e9::m513aysusmxwk51w2txeqipsv: failed to walk /var/lib/docker/tmp/buildkit-mount2809367210/cover/fleet-server-8.12.0-SNAPSHOT-linux-x86_64: lstat /var/lib/docker/tmp/buildkit-mount2809367210/cover/fleet-server-8.12.0-SNAPSHOT-linux-x86_64: no such file or directory
--
  | make: *** [Makefile:348: build-e2e-agent-image] Error 1
  | 🚨 Error: The command exited with status 2

Tried to use es version 8.13-snapshot in e2e tests, but it looks like there is no such version yet: https://artifacts-api.elastic.co/v1/versions/

@juliaElastic
Copy link
Contributor Author

/test

Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, let's get the tests passing and merge

@juliaElastic
Copy link
Contributor Author

/test

@@ -740,3 +694,49 @@ func makeUpdatePolicyBody(policyID string, newRev, coordIdx int64) []byte {

return buf.Bytes()
}

func invalidateAPIKeys(ctx context.Context, zlog zerolog.Logger, bulk bulk.Bulk, toRetireAPIKeyIDs []model.ToRetireAPIKeyIdsItems, skip string) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this function outside of AckT so it is accessible from handleCheckin too

@@ -119,6 +119,11 @@ func (ct *CheckinT) handleCheckin(zlog zerolog.Logger, w http.ResponseWriter, r

agent, err := authAgent(r, &id, ct.bulker, ct.cache)
if err != nil {
// invalidate remote API keys of force unenrolled agents
if err == ErrAgentInactive && agent != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testing steps of the force unenroll scenario to the pr description.

Decided to invalidate remote API keys on checkin here, when finding that the agent is inactive, because the agent becomes inactive after force unenroll. It's not guaranteed that the agent is going to check-in after force unenroll, but we can't do the invalidation in kibana, as kibana doesn't have access to the remote es service token as it is a secret.
Fleet-server also doesn't handle force unenroll actions, and even if it would, fleet server only reads actions on agent checkin. So I didn't find a better way to invalidate api keys after force unenroll.

@juliaElastic
Copy link
Contributor Author

@michel-laterman thanks for the review! added a fix for force unenroll too in this commit: 35f3539

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
73.8% 73.8% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@juliaElastic juliaElastic merged commit d4413c5 into elastic:main Dec 12, 2023
8 checks passed
@juliaElastic juliaElastic added the backport-v8.12.0 Automated backport with mergify label Dec 12, 2023
juliaElastic added a commit to juliaElastic/fleet-server that referenced this pull request Dec 12, 2023
* invalidate remote api keys on unenroll

* fix test

* using latest es snapshot in tests

* try with latest 8.12 snapshot

* use new 8.13 snapshot

* try without build id

* invalidate remote api key after force unenroll

* fix lint

* added integration test to force unenroll

* added integration test on unenroll

* fix lint
juliaElastic added a commit that referenced this pull request Dec 13, 2023
* invalidate remote api keys on unenroll (#3154)

* invalidate remote api keys on unenroll

* fix test

* using latest es snapshot in tests

* try with latest 8.12 snapshot

* use new 8.13 snapshot

* try without build id

* invalidate remote api key after force unenroll

* fix lint

* added integration test to force unenroll

* added integration test on unenroll

* fix lint

* use latest snapshot

* try with 8.12-snapshot

* change back version in version.go
@@ -1,4 +1,4 @@
ELASTICSEARCH_VERSION=8.12.0-9d443b17-SNAPSHOT
ELASTICSEARCH_VERSION=8.13.0-SNAPSHOT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change broke the automation with updatecli, see https://github.com/elastic/fleet-server/actions/runs/7197121542

is that the expected behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I changed it to fix e2e tests, it was not working with a specific 8.13 build

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we fix this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind raising prs to fix this or do you want me to?

mergify bot pushed a commit that referenced this pull request Dec 13, 2023
* invalidate remote api keys on unenroll

* fix test

* using latest es snapshot in tests

* try with latest 8.12 snapshot

* use new 8.13 snapshot

* try without build id

* invalidate remote api key after force unenroll

* fix lint

* added integration test to force unenroll

* added integration test on unenroll

* fix lint

(cherry picked from commit d4413c5)

# Conflicts:
#	dev-tools/integration/.env
This was referenced Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.12.0 Automated backport with mergify bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants