Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add List of Metrics for Windows + Design #166

Merged
merged 3 commits into from
Feb 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions receiver/awscontainerinsightreceiver/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -851,6 +851,9 @@ kubectl apply -f config.yaml

The attribute `container_status_reason` is present only when `container_status` is in "Waiting" or "Terminated" State. The attribute `container_last_termination_reason` is present only when `container_status` is in "Terminated" State.

## Available Metrics and Resource Attributes on Windows
Refer [Metrics on Windows](./internal/k8swindows/README.md)

This is a sample configuration for AWS Container Insights using the `awscontainerinsightreceiver` and `awsemfexporter` for an ECS cluster to collect the instance level metrics:
```
receivers:
Expand Down
11 changes: 10 additions & 1 deletion receiver/awscontainerinsightreceiver/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

## Container Insights Architecture for EKS
![architecture](images/eks-design.png)
![architecture for Windows Nodes](images/eks-windows-design.png)

## Container Insights Architecture for ECS
![architecture](images/ecs-design.png)
Expand All @@ -16,7 +17,15 @@
* Some pod/container related labels like podName, podId, namespace, containerName are extracted from the container spec provided by `cadvisor`. This labels will be added as resource attributes for the metrics and the AWS Container Insights processor needs those attributes to do further processing of the metrics.
* `k8sapiserver`
* Collects cluster-level metrics from k8s api server
* The receiver is designed to run as daemonset. This guarantees that only one receiver is running per cluster node. To make sure cluster-level metrics are not duplicated, the receiver integrate with K8s client which support leader election API. It leverages k8s configmap resource as some sort of LOCK primitive. The deployment will create a dedicate configmap as the lock resource. If one receiver is required to elect a leader, it will try to lock (via Create/Update) the configmap. The API will ensure one of the receivers hold the lock to be the leader. The leader continually “heartbeats” to claim its leaderships, and the other candidates periodically make new attempts to become the leader. This ensures that a new leader will be elected quickly, if the current leader fails for some reason.
* The receiver is designed to run as daemonset. This guarantees that only one receiver is running per cluster node. To make sure cluster-level metrics are not duplicated, the receiver integrate with K8s client which support leader election API. It leverages k8s configmap resource as some sort of LOCK primitive. The deployment will create a dedicate configmap as the lock resource. If one receiver is required to elect a leader, it will try to lock (via Create/Update) the configmap. The API will ensure one of the receivers hold the lock to be the leader. The leader continually “heartbeats” to claim its leaderships, and the other candidates periodically make new attempts to become the leader. This ensures that a new leader will be elected quickly, if the current leader fails for some reason.

For Windows Worker Nodes,
`awscontainerinsightreceiver` collects data from 2 main sources:
* `kubelet` Summary API
* Kubelet on Windows node expose summary API which returns CPU, Memory, Network and storage metrics for container, pod and Node.
* The receiver generates Container Insights specific metrics from the raw metrics provided by `kubelet`. The metrics are categorized as different infrastructure layers like node, node filesystem, node network, pod, pod network, container, and container filesystem.
* HCS Shim API
* HCS Shim API provides Network metrics for containers.

The following two packages are used to decorate metrics:

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
253 changes: 253 additions & 0 deletions receiver/awscontainerinsightreceiver/internal/k8swindows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
## Available Metrics and Resource Attributes for Windows worker Nodes

### Node
| Metric | Unit |
|-------------------------------------------|--------------|
| node_cpu_limit | Millicore |
| node_cpu_request | Millicore |
| node_cpu_reserved_capacity | Percent |
| node_cpu_usage_total | Millicore |
| node_cpu_utilization | Percent |
| node_memory_limit | Bytes |
| node_memory_pgfault | Count/Second |
| node_memory_pgmajfault | Count/Second |
| node_memory_request | Bytes |
| node_memory_reserved_capacity | Percent |
| node_memory_rss | Bytes |
| node_memory_usage | Bytes |
| node_memory_utilization | Percent |
| node_memory_working_set | Bytes |
| node_network_rx_bytes | Bytes/Second |
| node_network_rx_dropped | Count/Second |
| node_network_rx_errors | Count/Second |
| node_network_total_bytes | Bytes/Second |
| node_network_tx_bytes | Bytes/Second |
| node_network_tx_dropped | Count/Second |
| node_network_tx_errors | Count/Second |
| node_number_of_running_containers | Count |
| node_number_of_running_pods | Count |
| node_status_condition_ready | Count |
| node_status_condition_pid_pressure | Count |
| node_status_condition_memory_pressure | Count |
| node_status_condition_disk_pressure | Count |
| node_status_condition_network_unavailable | Count |
| node_status_condition_unknown | Count |
| node_status_capacity_pods | Count |
| node_status_allocatable_pods | Count |

<br/><br/>
| Resource Attribute |
|-----------------------|
| ClusterName |
| InstanceType |
| NodeName |
| Timestamp |
| Type |
| Version |
| Sources |
| kubernetes |
| OperatingSystem |

<br/><br/>
<br/><br/>

### Node Filesystem
| Metric | Unit |
|------------------------------|---------|
| node_filesystem_available | Bytes |
| node_filesystem_capacity | Bytes |
| node_filesystem_usage | Bytes |
| node_filesystem_utilization | Percent |

<br/><br/>
| Resource Attribute |
|---------------------- |
| AutoScalingGroupName |
| ClusterName |
| InstanceId |
| InstanceType |
| NodeName |
| Timestamp |
| Type |
| Version |
| Sources |
| kubernete |
| OperatingSystem |
<br/><br/>
<br/><br/>

### Node Network
| Metric | Unit |
|------------------------------------|--------------|
| node_interface_network_rx_bytes | Bytes/Second |
| node_interface_network_rx_dropped | Count/Second |
| node_interface_network_rx_errors | Count/Second |
| node_interface_network_total_bytes | Bytes/Second |
| node_interface_network_tx_bytes | Bytes/Second |
| node_interface_network_tx_dropped | Count/Second |
| node_interface_network_tx_errors | Count/Second |

<br/><br/>
| Resource Attribute |
|-----------------------|
| AutoScalingGroupName |
| ClusterName |
| InstanceId |
| InstanceType |
| NodeName |
| Timestamp |
| Type |
| Version |
| interface |
| Sources |
| kubernete |
| OperatingSystem |
<br/><br/>
<br/><br/>

### Pod
| Metric | Unit |
|-------------------------------------------------------------------|--------------|
| pod_cpu_limit | Millicore |
| pod_cpu_request | Millicore |
| pod_cpu_reserved_capacity | Percent |
| pod_cpu_usage_total | Millicore |
| pod_cpu_utilization | Percent |
| pod_cpu_utilization_over_pod_limit | Percent |
| pod_memory_limit | Bytes |
| pod_memory_max_usage | Bytes |
| pod_memory_pgfault | Count/Second |
| pod_memory_pgmajfault | Count/Second |
| pod_memory_request | Bytes |
| pod_memory_reserved_capacity | Percent |
| pod_memory_rss | Bytes |
| pod_memory_usage | Bytes |
| pod_memory_utilization | Percent |
| pod_memory_utilization_over_pod_limit | Percent |
| pod_memory_working_set | Bytes |
| pod_network_rx_bytes | Bytes/Second |
| pod_network_rx_dropped | Count/Second |
| pod_network_rx_errors | Count/Second |
| pod_network_total_bytes | Bytes/Second |
| pod_network_tx_bytes | Bytes/Second |
| pod_network_tx_dropped | Count/Second |
| pod_network_tx_errors | Count/Second |
| pod_number_of_container_restarts | Count |
| pod_number_of_containers | Count |
| pod_number_of_running_containers | Count |
| pod_status_ready | Count |
| pod_status_scheduled | Count |
| pod_status_unknown | Count |
| pod_status_failed | Count |
| pod_status_pending | Count |
| pod_status_running | Count |
| pod_status_succeeded | Count |
| pod_container_status_running | Count |
| pod_container_status_terminated | Count |
| pod_container_status_waiting | Count |
| pod_container_status_waiting_reason_crash_loop_back_off | Count |
| pod_container_status_waiting_reason_image_pull_error | Count |
| pod_container_status_waiting_reason_start_error | Count |
| pod_container_status_waiting_reason_create_container_error | Count |
| pod_container_status_waiting_reason_create_container_config_error | Count |
| pod_container_status_terminated_reason_oom_killed | Count |

| Resource Attribute |
|------------------------|
| AutoScalingGroupName |
| ClusterName |
| InstanceId |
| InstanceType |
| K8sPodName |
| Namespace |
| NodeName |
| PodId |
| Timestamp |
| Type |
| Version |
| Sources |
| kubernete |
| pod_status |
| OperatingSystem |

<br/><br/>

### Pod Network
| Metric | Unit |
|-----------------------------------|--------------|
| pod_interface_network_rx_bytes | Bytes/Second |
| pod_interface_network_rx_dropped | Count/Second |
| pod_interface_network_rx_errors | Count/Second |
| pod_interface_network_total_bytes | Bytes/Second |
| pod_interface_network_tx_bytes | Bytes/Second |
| pod_interface_network_tx_dropped | Count/Second |
| pod_interface_network_tx_errors | Count/Second |


<br/><br/>
| Resource Attribute |
|----------------------|
| AutoScalingGroupName |
| ClusterName |
| InstanceId |
| InstanceType |
| K8sPodName |
| Namespace |
| NodeName |
| PodId |
| Timestamp |
| Type |
| Version |
| interface |
| Sources |
| kubernete |
| pod_status |
| OperatingSystem |

<br/><br/>
<br/><br/>


### Container
| Metric | Unit |
|---------------------------------------------------|--------------|
| container_cpu_limit | Millicore |
| container_cpu_request | Millicore |
| container_cpu_usage_total | Millicore |
| container_cpu_utilization | Percent |
| container_cpu_utilization_over_container_limit | Percent |
| container_memory_limit | Bytes |
| container_memory_mapped_file | Bytes |
| container_memory_pgfault | Count/Second |
| container_memory_pgmajfault | Count/Second |
| container_memory_request | Bytes |
| container_memory_rss | Bytes |
| container_memory_usage | Bytes |
| container_memory_utilization | Percent |
| container_memory_utilization_over_container_limit | Percent |
| container_memory_working_set | Bytes |
| number_of_container_restarts | Count |

<br/><br/>

| Resource Attribute |
|-----------------------------------|
| AutoScalingGroupName |
| ClusterName |
| ContainerId |
| ContainerName |
| InstanceId |
| InstanceType |
| K8sPodName |
| Namespace |
| NodeName |
| PodId |
| Timestamp |
| Type |
| Version |
| Sources |
| kubernetes |
| container_status |
| container_status_reason |
| container_last_termination_reason |
| OperatingSystem |
Loading