Skip to content

Latest commit

 

History

History
33 lines (18 loc) · 19.7 KB

monitoring-metrics.md

File metadata and controls

33 lines (18 loc) · 19.7 KB

Monitoring Metrics

Pinot provides metrics out of the box so that you can monitor every aspect of performance and robustness of the Pinot cluster. Most of the metrics are available either at table level or instance level. There are three main categories of metrics:

  • Gauge – A single value at any point in time
  • Meter – Rates of the metric per unit of time
  • Timer – Record durations and can be used to fetch average duration per unit of time, percentile values, minimum or maximum values, etc.

Pinot Server

Metric NameDescriptionMetric type
LLC-PARTITION-CONSUMINGThis gives a binary value based on whether low-level consumption is healthy (1) or unhealthy (0). It’s important to ensure at least a single replica of each partition is consuming.
HIGHEST-STREAM-OFFSET-CONSUMEDThe highest offset which has been consumed so far
DOCUMENT_COUNTtotal number of records in table
SEGMENT_COUNTtotal number of segments in table
UPSERT_PRIMARY_KEYS_COUNTtotal unique primary keys in table
LAST_REALTIME_SEGMENT_CREATION_DURATION_SECONDStime in seconds it took for latest real-time segment to get created
LAST_REALTIME_SEGMENT_CREATION_WAIT_TIME_SECONDStime in seconds it took for segment creation to start (generally due to waiting for a lock to get acquired)
LAST_REALTIME_SEGMENT_INITIAL_CONSUMPTION_DURATION_SECONDStime in seconds spent consuming records for latest segment
LAST_REALTIME_SEGMENT_CATCHUP_DURATION_SECONDStime in seconds spent on catching up to the latest offset in metadata. This can happen when multiple servers are consuming from same partition.
LAST_REALTIME_SEGMENT_COMPLETION_DURATION_SECONDStime in seconds between when we stopped consuming records and when the segment gets committed
REALTIME_OFFHEAP_MEMORY_USEDoff heap memory in bytes current used by real-time segments
REALTIME_SEGMENT_NUM_PARTITIONSNumber of partitions for a table
LLC_SIMULTANEOUS_SEGMENT_BUILDSNumber of segments being built currently
REALTIME_INGESTION_DELAY_MSPer partition metric that measures the delay in milliseconds from the time an event was produced to the stream that feeds Pinot until the event was consumed by Pinot. Partitions that are not actively consuming due to lack of events will report 0 delay. Partitions that are stuck or falling behind will report their last measured delay aged by the time since the sample was taken: this enables the user to monitor partitions that have events queued but where Pinot is falling behind in consumption. This metric assumes event timestamps is UTC time zone, if timestamps are using other timezones, the delay shown will be offset.
ROWS_WITH_ERRORSnumber of rows that either didn't get transformed or didn't get indexed.
REALTIME_ROWS_CONSUMEDtotal number of records consumed from input
INVALID_REALTIME_ROWS_DROPPEDnumber of records that were filtered based on FilterConfig specified in table config
REALTIME_CONSUMPTION_EXCEPTIONSnumber of rows that were not consumed because of some exception. It doesn't track exceptions during transformation and indexing.
RELOAD_FAILURESNumber of failures occurred while reloading segments
REFRESH_FAILURESNumber of failures occurred while refreshing segments
UNTAR_FAILURESNumber of failures occurred while uncompressing segments
SEGMENT_DOWNLOAD_FAILURESNumber of failures occurred while downloading segments from deep store to local
DELETED_SEGMENT_COUNTNumber of segments deleted either because of retention policies, explicit delete request etc.
QUERIESNumber of queries executed
QUERY_EXECUTION_EXCEPTIONSNumber of exceptions encountered during query execution
NUM-MISSING-SEGMENTSNumber of missing segments that the broker queried for (expected to be on the server) but the server didn’t have. This can be due to retention or stale routing table
NO_TABLE_ACCESSnumber of query requests for which table access was denied either due to table not being present or access control restrictions.
HELIX_ZOOKEEPER_RECONNECTSNumber of times Server instance re-connected to zookeeper.
NETTY_CONNECTION_BYTES_RECEIVEDtotal bytes received by the server
NETTY_CONNECTION_BYTES_SENTtotal bytes sent by the server
NETTY_CONNECTION_RESPONSES_SENTtotal responses sent by the server
FRESHNESS_LAG_MStime period between when the data was last updated in the table and the current time
NETTY_CONNECTION_SEND_RESPONSE_LATENCYtime spent in sending response to brokers after the results are available
EXECUTION_THREAD_CPU_TIME_NStime spent by all threads processing query and results (doesn't includes time spent in system activities)
SYSTEM_ACTIVITIES_CPU_TIME_NStime spent in nanoseconds processing query on the servers (only counts system acitivities such as GC, OS paging etc.)
RESPONSE_SER_CPU_TIME_NStime spent in nanoseconds serializing query response on servers
TOTAL_CPU_TIME_NStotal time spent in nanoseconds processing query on the servers
END_TO_END_REALTIME_INGESTION_DELAY_MSWhen supported by the underlying stream, this metric provides the ingestion delay in milliseconds from the time an event was ingested by the first stream in your ingestion pipeline to the time the event was ingested by Pinot. The metric is not emitted when the underlying stream does not support this feature. The metric relies on this metric being in UTC time zone. If your time stamp is in another time zone, your metric will be offset accordingly.

Tracking time spent in various phases of Query execution in milliseconds -

Metric NameDescription
REQUEST_DESERIALIZATIONTime spent in deserializing query request
SEGMENT_PRUNINGTime spent in Segment Pruning
BUILD_QUERY_PLANTime spent in building query plan
QUERY_PLAN_EXECUTIONTime spent in executing query plan
QUERY_PROCESSINGTotal Time spent in processing the query request from receiving the parsed query to getting data. Doesn't include ser-de time.
SCHEDULER_WAITTime spent in the scheduler queue waiting for the query to be executed
RESPONSE_SERIALIZATIONTime spent in serializing query response
TOTAL_QUERY_TIMETotal time to take from receiving the query to returning the responde.

Pinot Broker

Metric NameDescriptionMetric Type
UNHEALTHY_SERVERSNumber of unhealthy servers detected
QUERY_QUOTA_CAPACITY_UTILIZATION_RATEpercentage of configured rate limit being used on each broker
MAX_BURST_QPS
QUERY_RATE_LIMIT_DISABLED1 if rate limit is enabled on broker, 0 otherwise
REQUEST_SIZEQuery String length on each broker
RESIZE_TIME_MStime spent in resizing results for the output. either because of LIMIT or maximum allowed group by keys or any other criteria
QUERIESThe rate which an individual broker is receiving queries. Units are in QPS
REQUEST_COMPILATION_EXCEPTIONSNumber of queries which failed during compilation
RESOURCE_MISSING_EXCEPTIONSNumber of queries for which table doesn't exists
QUERY_VALIDATION_EXCEPTIONSNumber of invalid queries
UNKNOWN_COLUMN_EXCEPTIONSNumber of queries with unknown columns
NO_SERVER_FOUND_EXCEPTIONSNumber of queries for which no server was found to contain its data
REQUEST_TIMEOUT_BEFORE_SCATTERED_EXCEPTIONSNumber of times query timed out before even being sent to the servers
REQUEST_CHANNEL_LOCK_TIMEOUT_EXCEPTIONSnumber of times query failes while trying to acquire lock to server connections
REQUEST_SEND_EXCEPTIONSNumber of queries failed while sending to server
RESPONSE_FETCH_EXCEPTIONSNumber of queries failed while handling response from servers
DATA_TABLE_DESERIALIZATION_EXCEPTIONSNumber of queries failed while deserializing response data from servers
RESPONSE_MERGE_EXCEPTIONSNumber of queries that failed while merging responses from multiple servers. This can be due to schema inconsitency or any other issues
BROKER_RESPONSES_WITH_PROCESSING_EXCEPTIONSNumber of queries where atleast one exception occured
BROKER_RESPONSES_WITH_PARTIAL_SERVERS_RESPONDEDNumber of queries with incomplete results due to missing responses from servers
BROKER_RESPONSES_WITH_NUM_GROUPS_LIMIT_REACHEDNumber of queries where total number of groups exceeded configured limit (default limit - 100K)
DOCUMENTS_SCANNEDTotal number of documents read from segments in each query
ENTRIES_SCANNED_IN_FILTER
ENTRIES_SCANNED_POST_FILTER
NUM_RESIZESNumber of result resizes for queries
REQUEST_DROPPED_DUE_TO_ACCESS_ERRORNumber of queries dropped due to invalid access permissions on table
GROUP_BY_SIZENumber of rows in group by queries
TOTAL_SERVER_RESPONSE_SIZETotal number of bytes received from servers for queries
QUERY_QUOTA_EXCEEDEDNumber of queries failed due to query rate limit being breached
NO_SERVING_HOST_FOR_SEGMENTNumber of segments per query for which no servers are available
SERVER_MISSING_FOR_ROUTINGNumber of servers that could not be added to routing table for query
NETTY_CONNECTION_REQUESTS_SENTtotal number of requests sent to servers
NETTY_CONNECTION_BYTES_SENTtotal bytes sent to servers
NETTY_CONNECTION_BYTES_RECEIVEDtotal bytes received from servers
PROACTIVE_CLUSTER_CHANGE_CHECKNumber of requests raised to zookeeper to check the cluster state such as IDEAL STATES, EXTERNAL VIEW etc.
HELIX_ZOOKEEPER_RECONNECTSNumber of times broker instance re-connected to zookeeper.
CLUSTER_CHANGE_QUEUE_TIMETime spent in milliseconds in queue for cluster change requests
FRESHNESS_LAG_MStime period between when the data was last updated in the table and the current time
NETTY_CONNECTION_SEND_REQUEST_LATENCYlatency of sending the request from broker to server
OFFLINE_THREAD_CPU_TIME_NSaggregated thread cpu time in nanoseconds for query processing from offline servers
REALTIME_THREAD_CPU_TIME_NSaggregated thread cpu time in nanoseconds for query processing from real-time servers
OFFLINE_SYSTEM_ACTIVITIES_CPU_TIME_NSaggregated system activities cpu time in nanoseconds for query processing from offline servers (e.g. GC, OS paging etc.)
REALTIME_SYSTEM_ACTIVITIES_CPU_TIME_NSaggregated system activities cpu time in nanoseconds for query processing from real-time servers (e.g. GC, OS paging etc.)
OFFLINE_RESPONSE_SER_CPU_TIME_NSaggregated response serialization cpu time in nanoseconds for query processing from offline servers
REALTIME_RESPONSE_SER_CPU_TIME_NSaggregated response serialization cpu time in nanoseconds for query processing from real-time servers
OFFLINE_TOTAL_CPU_TIME_NSaggregated total cpu time(thread + system activities + response serialization) in nanoseconds for query processing from offline servers
REALTIME_TOTAL_CPU_TIME_NStime(thread + system activities + response serialization) in nanoseconds for query processing from real-time servers

Tracking time spent in various phases of Query execution in milliseconds -

Metric NameDescriptionMetric Type
REQUEST_COMPILATIONTime spent in compiling SQL query
QUERY_EXECUTIONTotal Time spent in query executiong
QUERY_ROUTINGTime spent in creating a routing table for segments
SCATTER_GATHERTime spent in sending and collecting responses from servers.
REDUCETime spent in combining query results from multiple servers
AUTHORIZATIONTime spent checking table access after query compilation

Pinot Controller

Metric NameDescriptionMetric Type
PERCENT_SEGMENTS_AVAILABLEPercentage of complete online replicas in external view as compared to replicas in ideal state
NUMBER_OF_REPLICASTotal number of replicas available for table
SEGMENTS_IN_ERROR_STATENumber of segments in an ERROR state for a given table.
TABLE_STORAGE_QUOTA_UTILIZATIONShows how much of the table’s storage quota is currently being used, metric will a percentage of a the entire quota.
LAST_PUSH_TIME_DELAY_HOURSThe time in hours since the last time an offline segment has been pushed to the controller.
HEALTHCHECK_OK_CALLSNumber of health check requests for which controller was healthy
HEALTHCHECK_BAD_CALLSNumber of health check requests for which controller was unhealthy
CONTROLLER_INSTANCE_POST_ERRORErrors occurred while updating state for an instance (server and broker)
CONTROLLER_SEGMENT_UPLOAD_ERRORErrors occurred while sending uploading segment request
CONTROLLER_SCHEMA_UPLOAD_ERRORErrors occurred while uploading schema
CONTROLLER_TABLE_SCHEMA_UPDATE_ERRORErrors occurred while updating schema
CONTROLLER_TABLE_ADD_ERRORErrors occurred while adding table config
CONTROLLER_TABLE_UPDATE_ERRORErrors occurred while updating table config
CONTROLLER_TABLE_TENANT_UPDATE_ERRORErrors occurred while updating a Tenant
CONTROLLER_TABLE_TENANT_CREATE_ERRORErrors occurred while creating a Tenant
CONTROLLER_TABLE_TENANT_DELETE_ERRORErrors while deleting a Tenant
CONTROLLER_REALTIME_TABLE_SEGMENT_ASSIGNMENT_ERRORErrors occurred while assigning a real-time segment to a server instance
CONTROLLER_LEADERSHIP_CHANGE_WITHOUT_CALLBACKNumber of times a controller loses/gains leadership without a callback from Helix
CONTROLLER_PERIODIC_TASK_RUNNumber of Periodic tasks running currently
CONTROLLER_PERIODIC_TASK_ERRORNumber of Periodic tasks that failed due to error
NUMBER_TIMES_SCHEDULE_TASKS_CALLEDMinion tasks schedule request sent to controller
NUMBER_TASKS_SUBMITTEDNumber of minion tasks submitted to the controller.
NUMBER_SEGMENT_UPLOAD_TIMEOUT_EXCEEDEDNumber of segments uploads failed due to timeout. Segments are re-uploaded in this case by controller itself.
CRON_SCHEDULER_JOB_TRIGGEREDNumber of minion tasks triggered that use cron
NUMBER_ADHOC_TASKS_SUBMITTEDNumber of minion ad hoc tasks submitted
LLC_STATE_MACHINE_ABORTSNumber of times a real-time segment commit operation was aborted
LLC_ZOOKEEPER_FETCH_FAILURESNumber of Zookeeper metadata fetch requests failed
LLC_ZOOKEEPER_UPDATE_FAILURESNumber of Zookeeper metadata update requests failed
LLC_STREAM_DATA_LOSSIndicates data loss for table either due to offsets available to consume from topic larger than the last stored offset in pinot or segment lost in CONSUMING state
HELIX_ZOOKEEPER_RECONNECTSNumber of times broker instance re-connected to zookeeper.
CRON_SCHEDULER_JOB_EXECUTION_TIME_MSTime spent in scheduling cron jobs
IDEAL_STATE_UPDATE_FAILUREIndicates failed to update ideal state of table
IDEAL_STATE_UPDATE_RETRYNumber of retries update ideal state of table
IDEAL_STATE_UPDATE_TIME_MSTime spent in updating ideal state for table

Pinot Minion

Metric NameDescriptionMetric Type
NUMBER_OF_TASKSNumber of tasks currently running
NUMBER_TASKS_EXECUTEDNumber of tasks triggered in Minion
NUMBER_TASKS_COMPLETEDNumber of tasks completed successfully
NUMBER_TASKS_CANCELLEDNumber of tasks that were cancelled
NUMBER_TASKS_FAILEDNumber of tasks that failed
NUMBER_TASKS_FATAL_FAILEDNumber of tasks that failed with unretryable exceptions
TASK_QUEUEINGTime spent by tasks in queue
TASK_EXECUTIONTime spent by tasks in execution