Skip to content

Latest commit

 

History

History
173 lines (147 loc) · 16.7 KB

METRICS.md

File metadata and controls

173 lines (147 loc) · 16.7 KB

Metrics

We provide various metrics about memory, disk, and important procedures. These metrics could help identify performance issue or monitor Celeborn cluster.

Prerequisites

  1. Enable Celeborn metrics. Set configuration celeborn.metrics.enabled to true (true by default).

  2. Configure Celeborn metrics properties.

cd $CELEBORN_HOME/conf
cp metrics.properties.template metrics.properties

The default values of the Celeborn metrics configuration are as follows:

*.sink.prometheusServlet.class=org.apache.celeborn.common.metrics.sink.PrometheusServlet
  1. Install Prometheus (https://prometheus.io/). We provide an example for Prometheus config file:
# Prometheus example config
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "Celeborn"
    metrics_path: /metrics/prometheus
    scrape_interval: 15s
    static_configs:
      - targets: [ "master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096" ]
  1. Install Grafana server (https://grafana.com/grafana/download).

  2. Import Celeborn dashboard into Grafana.

You can find the Celeborn dashboard templates under the assets/grafana directory. celeborn-dashboard.json displays Celeborn internal metrics and celeborn-jvm-dashboard.json displays Celeborn JVM related metrics.

Optional

We recommend you to install node exporter (https://github.com/prometheus/node_exporter) on every host, and configure Prometheus to scrape information about the host. Grafana will need a dashboard (dashboard id:8919) to display host details.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "Celeborn"
    metrics_path: /metrics/prometheus
    scrape_interval: 15s
    static_configs:
      - targets: [ "master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096" ]
  - job_name: "node"
    static_configs:
      - targets: [ "master-ip:9100","worker1-ip:9100","worker2-ip:9100","worker3-ip:9100","worker4-ip:9100" ]

Import Dashboard Steps

Here is an example of Grafana dashboard importing.

g1 g2 g3 g4 g6 g5

Details

MetricName Scope Description
WorkerCount master The count of active workers.
ExcludedWorkerCount master The count of workers in excluded list.
RunningApplicationCount master The count of running applications in the cluster.
OfferSlotsTime master The time of offer slots.
PartitionSize master The estimated partition size of last 20 flush window whose length is 15 seconds by defaults.
PartitionWritten master The active shuffle size.
PartitionFileCount master The active shuffle partition count.
diskFileCount master and worker The count of disk files consumption by each user.
diskBytesWritten master and worker The amount of disk files consumption by each user.
hdfsFileCount master and worker The count of hdfs files consumption by each user.
hdfsBytesWritten master and worker The amount of hdfs files consumption by each user.
RegisteredShuffleCount master and worker The value means count of registered shuffle.
CommitFilesTime worker CommitFiles means flush and close a shuffle partition file.
ReserveSlotsTime worker ReserveSlots means acquire a disk buffer and record partition location.
FlushDataTime worker FlushData means flush a disk buffer to disk.
OpenStreamTime worker OpenStream means read a shuffle file and send client about chunks size and stream index.
FetchChunkTime worker FetchChunk means read a chunk from a shuffle file and send to client.
PrimaryPushDataTime worker PrimaryPushData means handle pushdata of primary partition location.
ReplicaPushDataTime worker ReplicaPushData means handle pushdata of replica partition location.
WriteDataFailCount worker The count of writing PushData or PushMergedData failed in current worker.
ReplicateDataFailCount worker The count of replicating PushData or PushMergedData failed in current worker.
ReplicateDataWriteFailCount worker The count of replicating PushData or PushMergedData failed caused by write failure in peer worker.
ReplicateDataCreateConnectionFailCount worker The count of replicating PushData or PushMergedData failed caused by creating connection failed in peer worker.
ReplicateDataConnectionExceptionCount worker The count of replicating PushData or PushMergedData failed caused by connection exception in peer worker.
ReplicateDataTimeoutCount worker The count of replicating PushData or PushMergedData failed caused by push timeout in peer worker.
TakeBufferTime worker TakeBuffer means get a disk buffer from disk flusher.
SlotsAllocated worker Slots allocated in last hour
NettyMemory worker The value measures all kinds of transport memory used by netty.
SortTime worker SortTime measures the time used by sorting a shuffle file.
SortMemory worker SortMemory means total reserved memory for sorting shuffle files .
SortingFiles worker This value means the count of sorting shuffle files.
SortedFiles worker This value means the count of sorted shuffle files.
SortedFileSize worker This value means the count of sorted shuffle files 's total size.
DiskBuffer worker Disk buffers are part of netty used memory, means data need to write to disk but haven't been written to disk.
PausePushData worker PausePushData means the count of worker stopped receiving data from client.
PausePushDataAndReplicate worker PausePushDataAndReplicate means the count of worker stopped receiving data from client and other workers.
ActiveShuffleSize worker The active shuffle size of a worker including master replica and slave replica.
ActiveShuffleFileCount worker The active shuffle file count of a worker including master replica and slave replica.
jvm_gc_count JVM The GC count of each garbage collector.
jvm_gc_time JVM The GC cost time of each garbage collector.
jvm_memory_heap_init JVM The amount of heap init memory.
jvm_memory_heap_max JVM The amount of heap max memory.
jvm_memory_heap_used JVM The amount of heap used memory.
jvm_memory_heap_committed JVM The amount of heap committed memory.
jvm_memory_heap_usage JVM The percentage of heap memory usage.
jvm_memory_non_heap_init JVM The amount of non-heap init memory.
jvm_memory_non_heap_max JVM The amount of non-heap max memory.
jvm_memory_non_heap_used JVM The amount of non-heap uesd memory.
jvm_memory_non_heap_committed JVM The amount of non-heap committed memory.
jvm_memory_non_heap_usage JVM The percentage of non-heap memory usage.
jvm_memory_pools_init JVM The amount of each memory pool's init memory.
jvm_memory_pools_max JVM The amount of each memory pool's max memory.
jvm_memory_pools_used JVM The amount of each memory pool's used memory.
jvm_memory_pools_committed JVM The amount of each memory pool's committed memory.
jvm_memory_pools_used_after_gc JVM The amount of each memory pool's used memory after GC.
jvm_memory_pools_usage JVM The percentage of each memory pool's memory usage.
jvm_memory_total_init JVM The amount of total init memory.
jvm_memory_total_max JVM The amount of total max memory.
jvm_memory_total_used JVM The amount of total used memory.
jvm_memory_total_committed JVM The amount of each memory pool's committed memory.
jvm_direct_capacity JVM An estimate of the total capacity of the buffers in this pool
jvm_direct_count JVM An estimate of the number of buffers in the pool
jvm_direct_used JVM An estimate of the memory that JVM is using for this buffer pool
jvm_mapped_capacity JVM An estimate of the total capacity of the buffers in this pool
jvm_mapped_count JVM An estimate of the number of buffers in the pool
jvm_mapped_used JVM An estimate of the memory that JVM is using for this buffer pool
jvm_thread_count JVM The current number of threads.
jvm_thread_daemon_count JVM The current number of daemon threads.
jvm_thread_blocked_count JVM The current number of threads having blocked state.
jvm_thread_deadlock_count JVM The current number of threads having deadlock state.
jvm_thread_new_count JVM The current number of threads having new state.
jvm_thread_runnable_count JVM The current number of threads having runnable state.
jvm_thread_terminated_count JVM The current number of threads having terminated state.
jvm_thread_timed_waiting_count JVM The current number of threads having timed_waiting state.
jvm_thread_waiting_count JVM The current number of threads having waiting state.
JVMCPUTime system The JVM costs cpu time.
AvailableProcessors system The amount of system available processors.
LastMinuteSystemLoad system The last minute load of system.

Implementation

Celeborn master metrics : org/apache/celeborn/service/deploy/master/MasterSource.scala.

Celeborn worker metrics : org/apache/celeborn/service/deploy/worker/WorkerSource.scala.

Other common metrics are implemented in org.apache.celeborn.common.metrics.source package.

Dashboard Snapshots

The dashboard Celeborn-dashboard was generated by Grafana of version 10.0.3.

Here are some snapshots:

d1 d2