Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent gets unhealthy on assigning from policy with Elastic Defend integration to without Defend integration. #3617

Closed
amolnater-qasource opened this issue Oct 17, 2023 · 13 comments · Fixed by #3747
Assignees
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amolnater-qasource
Copy link

Kibana Build details:

VERSION: 8.11.0 BC2
BUILD: 67841
COMMIT: 636a8339cfad92998a5a5adb7be81e3546525ebf

Host OS: All

Preconditions:

  1. 8.11.0 BC2 Kibana cloud environment should be available.
  2. Windows agent should be installed using System and Elastic Defend integration.

Steps to reproduce:

  1. Navigate to Fleet>Agents tab.
  2. Assign the agent to the policy without Elastic Defend integration.
  3. Observe agent gets unhealthy.

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-10-17.14-45-43.mp4
Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-10-17.14-47-06.mp4

Agent Logs:
elastic-agent-diagnostics-2023-10-17T09-17-18Z-00.zip

Agent JSON:
ec2amaz-auvpg7n-agent-details.zip

Expected Result:
Agent shouldn't get unhealthy on assigning from policy with Elastic Defend integration to without Defend integration.

Impacted Testcase:
https://elastic.testrail.io/index.php?/tests/view/2166951

Similar issue:
#3507

@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:high Short-term priority; add to current release, or definitely next. labels Oct 17, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link

Secondary review for this ticket is Done

@cmacknz
Copy link
Member

cmacknz commented Oct 17, 2023

"components": [
    {
      "id": "log-default",
      "type": "log",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '128'",
      "units": [
        {
          "id": "log-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "winlog-default",
      "type": "winlog",
      "status": "DEGRADED",
      "message": "Degraded: pid '4900' missed 1 check-in",
      "units": [
        {
          "id": "winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "winlog-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "system/metrics-default",
      "type": "system/metrics",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '916'",
      "units": [
        {
          "id": "system/metrics-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "filestream-monitoring",
      "type": "filestream",
      "status": "DEGRADED",
      "message": "Degraded: pid '7040' missed 1 check-in",
      "units": [
        {
          "id": "filestream-monitoring-filestream-monitoring-agent",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "filestream-monitoring",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "beat/metrics-monitoring",
      "type": "beat/metrics",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '1552'",
      "units": [
        {
          "id": "beat/metrics-monitoring-metrics-monitoring-beats",
          "type": "input",
          "status": "CONFIGURING",
          "message": "Configuring"
        },
        {
          "id": "beat/metrics-monitoring",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "http/metrics-monitoring",
      "type": "http/metrics",
      "status": "DEGRADED",
      "message": "Degraded: pid '6044' missed 1 check-in",
      "units": [
        {
          "id": "http/metrics-monitoring-metrics-monitoring-agent",
          "type": "input",
          "status": "CONFIGURING",
          "message": "Configuring"
        },
        {
          "id": "http/metrics-monitoring",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    }
  ],

The diagnostics shows everything is healthy but the snapshot of the agent details while this was happening shows that multiple processes are missing check ins.

@cmacknz
Copy link
Member

cmacknz commented Oct 17, 2023

Usually missed check ins happen due to increased CPU usage on the machine. The policy reassignment would have caused every input in the policy to stop and restart, and the missed checkins correlate with that event in the logs:

"log.level":"debug","@timestamp":"2023-10-17T09:16:42.412Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":192},"message":"context is done. exiting.","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPED->STOPPING): Stopping","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPING","old_state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed filestream-monitoring-filestream-monitoring-agent (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring-filestream-monitoring-agent","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed filestream-monitoring (HEALTHY->DEGRADED): Degraded: pid '7040' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed http/metrics-monitoring (HEALTHY->DEGRADED): Degraded: pid '6044' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed winlog-default (HEALTHY->DEGRADED): Degraded: pid '4900' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Oct 17, 2023

Does this happen every time? If it does happen every time can it be reproduced on the latest 8.10.4 release? Are you able to check or record the CPU usage of the machine running the agent when the policy reassignment is happening?

@amolnater-qasource
Copy link
Author

Hi @cmacknz

Thank you for looking into this.

We have validated this issue on 8.10.4 released production environment and found it not reproducible there.
We have also revalidated this issue on 8.11.0 BC3 kibana cloud environment and found it consistently reproducible there.

Further observations:

  • CPU consumption reaches 100% on reassigning the policy on both the kibana versions- 8.10.4 and 8.11.0.
  • However, only 8.11.0 agent gets unhealthy on reassigning and 8.10.4 agent remains Unhealthy.

Screen Recordings:
8.10.4:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-10-19.10-28-25.mp4
Amol.Windows.-.ec2-18-212-196-2.compute-1.amazonaws.com.-.Remote.Desktop.Connection.2023-10-19.10-28-57.mp4

8.11.0:

Amol.Windows.-.ec2-18-212-196-2.compute-1.amazonaws.com.-.Remote.Desktop.Connection.2023-10-19.10-46-29.mp4
Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-10-19.10-47-19.mp4

Please let us know if anything else is required from our end.

Thanks!!

@cmacknz
Copy link
Member

cmacknz commented Oct 23, 2023

When the CPU reaches 100%, can you run elastic-agent diagnostics --cpu-profile and attach the results here?

@amolnater-qasource
Copy link
Author

Hi @cmacknz

We have reproduced this issue and recollected the new logs when CPU reached 100% with the shared command.

Logs:
elastic-agent-diagnostics-2023-10-25T11-34-54Z-00.zip

Please let us know if anything else is required from our end.

Thanks!!

@cmacknz
Copy link
Member

cmacknz commented Oct 25, 2023

The latest diagnostics isn't giving me any clues but likely this is the same problem as #3654

@cmacknz
Copy link
Member

cmacknz commented Oct 26, 2023

Pulling this into the sprint and assigning to Lee since this is the same as #3654 but we'll need to retest this scenario to confirm.

@amolnater-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.12.0 BC5 kibana cloud environment and found it fixed now.

Observations:

  • Agent remains healthy on assigning from policy with Elastic Defend integration to without Defend integration.
  • Issue is also not reproducible if the agent tamper protections are also enabled.

Build details:
VERSION: 8.12.0 BC5
BUILD: 70053
COMMIT: db9b8921b37139cbb1e11d23f6381f655edeb72b
Artifact Link: https://staging.elastic.co/8.12.0-9f05a310/summary-8.12.0.html

Screenshot:
image

Hence we are marking this issue as QA:Validated.

Thanks!

@amolnater-qasource amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Jan 11, 2024
@harshitgupta-qasource
Copy link

Bug Conversion:

We have updated 01 testcase for this scenario in our fleet test suite at:

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants