diff --git a/doc/event-alarm-framework/event-alarm-framework-seqdiag.png b/doc/event-alarm-framework/event-alarm-framework-seqdiag.png deleted file mode 100644 index 5473b2c63a..0000000000 Binary files a/doc/event-alarm-framework/event-alarm-framework-seqdiag.png and /dev/null differ diff --git a/doc/event-alarm-framework/event-alarm-framework.md b/doc/event-alarm-framework/event-alarm-framework.md index eb475cb6fd..00764de663 100644 --- a/doc/event-alarm-framework/event-alarm-framework.md +++ b/doc/event-alarm-framework/event-alarm-framework.md @@ -6,7 +6,6 @@ Event and Alarm Framework # Table of Contents * [Revision](#revision) * [About This Manual](#about-this-manual) - * [Scope](#scope) * [1 Feature Overview](#1-feature-overview) * [1.1 Requirements](#11-requirements) * [1.1.1 Functional Requirements](#111-functional-requirements) @@ -25,12 +24,8 @@ Event and Alarm Framework * [3.1.2.2 Sequence-ID](#3122-sequence-id) * [3.1.3 Alarm Consumer](#313-alarm-consumer) * [3.1.4 Event Receivers](#314-event-receivers) - * [3.1.4.1 syslog](#3141-syslog) - * [3.1.4.2 REST](#3142-rest) - * [3.1.4.3 gNMI](#3143-gnmi) - * [3.1.4.4 System LED](#3144-system-led) - * [3.1.4.5 Event/Alarm flooding](#3145-event/alarm-flooding) - * [3.1.4.6 Eventd continuous restart](#3146-event-continuous-restart) + * [3.1.4.1 System LED](#3144-system-led) + * [3.1.4.2 Event/Alarm flooding](#3145-event/alarm-flooding) * [3.1.5 Event Profile](#315-event-profile) * [3.1.6 CLI](#316-cli) * [3.1.7 Event Table and Alarm Table](#317-event-table-and-alarm-table) @@ -45,13 +40,16 @@ Event and Alarm Framework * [3.3.2.2 Configuration Commands](#3322-configuration-commands) * [3.3.2.3 Show Commands](#3323-show-commands) * [3.3.3 REST API Support](#333-rest-api-support) - * [4 Flow Diagrams](#4-flow-diagrams) - * [5 Warm Boot Support](#5-warm-boot-support) - * [5.1 Application warm boot](#51-application-warm-boot) - * [5.2 eventd warm boot](#52-eventd-warm-boot) - * [6 Scalability](#6-scalability) - * [7 Showtech Support](#7-showtech-support) - * [8 Unit Test](#8-unit-test) + * [4 Persistence](#4-persistence) + * [4.1 Warm reboot](#41-warm-reboot) + * [4.1.1 Application restart](#411-application-restart) + * [4.1.2 System warm reboot](#412-system-warm-reboot) + * [4.2 Fast reboot](#42-fast-reboot) + * [4.3 Cold reboot](#42-cold-reboot) + * [4.4 Power reset](#43-power-reset) + * [5 Scalability](#6-scalability) + * [6 Showtech Support](#7-showtech-support) + * [7 Unit Test](#8-unit-test) # Revision @@ -62,25 +60,24 @@ Event and Alarm Framework | 0.3 | 04/18/2022 | Bhavesh | Address review comments | # About this Manual -This document provides general information on the implementation and functionality of Event and Alarm Framework in SONiC. +This document provides a high level design of event and alarm management. It extends the existing event producer framework. +Refer for the current event framework. https://github.com/sonic-net/SONiC/blob/master/doc/event-alarm-framework/events-producer.md +Note: Wherever CLI is specified, it is the KLISH cli that is referred - SONiC native (CLICK) CLI is not updated for this feature. -Note: Wherever CLI is specified, it is the CLISH cli that is referred - SONiC native (CLICK) CLI is not updated for this feature. - -# Scope -This document describes the high-level design of Event and Alarm Framework. -It is not in the scope of the framework to update ANY of the applications to raise events and alarms. # 1 Feature Overview -The Event and Alarm Framework feature provides a centralized framework for applications in SONiC to raise notifications and store them for north bound interfaces to listen and fetch to monitor the device. +The Event and Alarm Framework feature provides a centralized framework for applications in SONiC to raise notifications and store them for north bound interfaces for monitoring the device. -Events and Alarms are notifications to indicate a change in the state of the system that operator may be interested in. +Events and alarms are notifications to indicate a change in the state of the system that operator may be interested in. Such a change has an important metric called *severity* to indicate how critical it is to the health of the system. * Events Events are "one shot" notifications to indicate an abnormal/important situation. + Events are "one shot" notifications to indicate an abnormal/important situation. + User logging in, authentication failure, configuration changed notification are all examples of events. * Alarms @@ -102,7 +99,7 @@ Such a change has an important metric called *severity* to indicate how critical The set of alarms and their severities are an indication to health of various applications of the system and System LED can be deduced from alarms. An acknowledged alarm means that operator is aware of the condition so, acknowledged alarm will be taken out of consideration. -Both events and alarms get recorded in a new DB called EVENT DB in a new redis instance. +Both events and alarms get recorded in redis DB. 1. Event Table @@ -150,7 +147,7 @@ As mentioned above, each event has an important characteristic: severity. SONiC The following describes how an alarm transforms and how various tables are updated. ![Alarm Life Cycle](event-alarm-framework-alarm-lifecycle.png) -By default every event will have a severity assigned by the component. The framework provides Event Profiles to customize severity of an event and also disable an event. +By default every event will have a severity assigned by the component. Template for event profile is as below: ``` @@ -166,23 +163,15 @@ Template for event profile is as below: ] } ``` -Event Profiles only contains declarations of events and their characteristics. There has to be an application to raise these events using eventnotify API. +Event profile only contains declarations of events and their characteristics. The framework maintains default event profile at /etc/evprofile/default.json. -Operator can download default event profile to a remote host. -This downloaded file can be modified by changing the severity or enable flag of event(s). -This modified file can then be uploaded to the device to /etc/evprofile/. -Operator can select any of these custom event profiles to change default properties of events. -The selected profile is persistent across reboots and will be in effect until operator selects either default or another custom profile. -In addition to storing events in DB, framework forwards log messages corresponding to all the events to syslog. -Syslog message displays the type (ALARM or EVENT), action (RAISE, CLEAR, ACKNOWLEDGE or UNACKNOWLEDGE) - when the message corresponds to an event of an alarm, name of the event and detailed message. - -gNMI clients can subscribe to receive events as they are raised. Subscribing through REST is being evaluated. +gNMI clients can subscribe to receive events as they are raised. CLI and REST/gNMI clients can query either table with filters - based on severity, delta based on timestamp, sequence-id etc., -Application owners need to identify various conditions that would be of interest to the operator and use the framework to raise events/alarms. +Application owners need to identify various conditions that would be of interest to the operator and use the framework to publish events/alarms. ## 1.1 Requirements @@ -191,65 +180,58 @@ Application owners need to identify various conditions that would be of interest | ID | Requirement | Comment | | :--- | :---- | :--- | -| 1 | Provide API via library for apps to publish events | | -| 2 | Provide API via library for apps to publish alarms | | -| 3 | Event Infra to write formatted syslog messages corresponding to all events to Syslog. | | -| 4 | Event Infra to persist all events and alarms in DB. | | -| 5 | Event Infra to read Event profile ( severity and enable/disable flag ) from a json file. | | -| 6 | Event Infra to read Event table parameters (size and # of days) from a config file. | | -| 7 | NBI interface (gNMI and REST) and CLI | | -| 7.1 | Events | | -| 7.1.1 | Openconfig interface to pull event information. | | -| 7.1.2 | Openconfig interface to pull event summary information. | | +| 1 | Event Infra to persist all events and alarms in DB. | | +| 2 | Event Infra to read Event profile ( severity and enable/disable flag ) from a json file. | | +| 3 | Event Infra to read Event table parameters (size and # of days) from a config file. | | +| 4 | NBI interface (gNMI and REST) and CLI | | +| 5.1 | Events | | +| 5.1.1 | Openconfig interface to pull event information. | | +| 5.1.2 | Openconfig interface to pull event summary information. | | | | Event summary information to contain cumulative counters for: | | | | - Raised-count (events) | | -| 7.1.3 | Openconfig interface to pull events using following filters | | +| 5.1.3 | Openconfig interface to pull events using following filters | | | | - ALL ( pull all events) | | | | - Severity. | | | | - Recent records (eg., last 5 minutes, one hour, one day). | | | | - Records between two timestamps, one timestamp and end, and beginning and a timestamp. | | | | - All records between two Sequence Numbers (incl begin and end) | | -| 7.2 | Alarms | | -| 7.2.1 | Openconfig interface to pull alarm information. | | -| 7.2.2 | Openconfig interface to pull alarm summary information. | | +| 5.2 | Alarms | | +| 5.2.1 | Openconfig interface to pull alarm information. | | +| 5.2.2 | Openconfig interface to pull alarm summary information. | | | | Counters for Total, Critical, Major, Minor, Warning, Acknowledged | | -| 7.2.3 | Openconfig interface to pull alarms using following filters | | +| 5.2.3 | Openconfig interface to pull alarms using following filters | | | | - All (pull all events) | | | | - Severity. | | | | - Recent alarms (eg., last 5 minutes, one hour, one day). | | | | - Records between two timestamps, one timestamp and end, and beginning and a timestamp. | | | | - All records between two Sequence Numbers (incl end and begin) | | -| 7.2.4 | Openconfig interface to acknowledge an alarm. | | -| 8 | CLI commands | | -| 8.1 | show alarm [ detail \| summary \| severity \| timestamp \| recent <5min\|1hr\|1day> \| sequence-number \| all] | | -| 8.2 | show event [ detail \| summary \| severity \| timestamp \| recent <5min\|1hr\|1day> \| sequence-number ] | | -| 8.3 | show event profile | | -| 8.4 | alarm acknowledge | | -| 8.5 | logging server [ log \| event ] | default is 'log' | -| 8.6 | event profile [ default \| name-of-file ] | | -| 9 | gNMI subscription | | -| 9.1 | Subscribe to openconfig Event container and Alarm container. All events and alarms published to gNMI subscribed clients. | | -| 10 | Clear all events | | -| 11 | Any change in open source should be aligned and upstream. | | +| 5.2.4 | Openconfig interface to acknowledge an alarm. | | +| 6 | CLI commands | | +| 6.1 | show alarm [ detail \| summary \| severity \| timestamp \| recent <5min\|1hr\|1day> \| sequence-number \| all] | | +| 6.2 | show event [ detail \| summary \| severity \| timestamp \| recent <5min\|1hr\|1day> \| sequence-number ] | | +| 6.4 | alarm acknowledge | | +| 6.5 | logging server [ log \| event ] | default is 'log' | +| 7 | gNMI subscription | | +| 7.1 | Subscribe to openconfig Event container and Alarm container. All events and alarms published to gNMI subscribed clients. | | +| 8 | Clear all events | | +| 8 | Any change in open source should be aligned and upstream. | | ## 1.2 Design Overview -![Block Diagram](event-alarm-framework-blockdiag.png) +![Block Diagram](eventd-block.png) ### 1.2.1 Basic Approach The feature involves new development. -Applications act as producers by writing to a table with the help of event notify library. -Eventd reads new record in the table and processes it: +EventDB service subscribes to the zmqproxy service. It processes events as: It saves the entry in event table; if the event has an action and if it is *RAISE*, record gets added to alarm table, severity counter in ALARM_STATS is increased. If the received event action is *CLEAR*, record in the ALARM table is removed and severity counter in ALARM_STATS of that alarm is reduced by 1. If eventd receives an event with action *ACKNOWLEDGE* from mgmt-framework, severity counter in ALARM_STATS is reduced by 1. If eventd receives an event with action *UNACKNOWLEDGE* from mgmt-framework, severity counter in ALARM_STATS is increased by 1. -Eventd then informs logging API to format the log message and send the message to syslog. Any application like pmon can subscribe to tables like ALARM_STATS to act accordingly. ### 1.2.2 Container -A new container by name, eventd, is created to hold event consumer logic. +The new service EventDB, will execute in the existing eventd container. # 2 Functionality ## 2.1 Target Deployment Use Cases @@ -258,7 +240,7 @@ The framework assigns an unique sequence number to each of the events sent by ap In addition, the framework provides the following key management services: -- Push model: Event/Alarm information to remote syslog hosts and subscribed gNMI clients +- Push model: Event/Alarm information to subscribed gNMI clients - Pull model: Event/Alarm information from CLI, REST/gNMI interfaces - Ability to change severity of events, turn off a particular event - Ability to acknowledge an alarm @@ -268,49 +250,36 @@ Event Management Framework allows applications to store "state" of the system fo # 3 Design ## 3.1 Overview -There are three players in the event framework. Producers, which raises events; a consumer to receive and process them as they are raised and a set of receivers one for each NBI type. - -Applications act as producers of events. -Event consumer class in eventd container receives and processes the received event. -Event consumer manages received events, updates event table, alarm table, event_stats table and alarm_stats tables and invokes logging API, which constructs message and sends it over to syslog. - -Operator can chose to change properties of events with the help of event profile. Default -event profile is available at */etc/evprofile/default.json*. User can download the default event profile, -modify and upload it back to the switch to apply it. - -Through event profile, user can change severity of any event and also can enable/disable a event. +Event-DB service in eventd container receives and processes the received events from zmqproxy service. +Event consumer manages received events, updates event table, alarm table, event_stats table and alarm_stats tables. Through CLI, REST or gNMI, event table and alarm table can be retrieved using various filters. ### 3.1.1 Event Producers -Application that need to raise an event, need to use event notifiy API ( LOG_EVENT ). -This API is part of *libeventnotify* library that applications need to link. -For one-shot events, applications need to provide event-id (name of the event), source, dynamic message, and event action set to NOTIFY. +For one-shot events, applications need to provide event-id (name), (source), dynamic message. -For alarms, applications need to provide event-id (name of the event), source, dynamic message, and event action (RAISE_ALARM / CLEAR_ALARM / ACK_ALARM /UNACK_ALARM). +For alarms, applications need to provide event-id (name ), source, dynamic message, and event action (RAISE_ALARM / CLEAR_ALARM / ACK_ALARM /UNACK_ALARM). The ACK_ALARM/UNACK_ALARM action types are used only by mgmt-framework to provide the functionality to acknowledge/unacknowledge the alarms through NBI. -Eventd maintains a json file of events and alarms at sonic-eventd/etc/evprofile/default.json. This is the default event profile that gets installed on the device at /etc/evprofile/default.json. +Eventd maintains a json file of events and alarms at sonic-eventd/var/evprofile/default.json. This is the default event profile that gets installed on the device at /etc/evprofile/default.json. Developers of new events or alarms need to update this file by declaring name and other characteristics - severity, enable flag and static message that gets appended with dynamic message. ``` { "__README__" : "This is default map of events that eventd uses. Developer can modify this file and send - SIGINT to eventd to make it read and use the updated file. Alternatively developer can test - the new event by adding it to a custom event profile and use 'event profile ' command - to apply that profile without sending SIGINT to eventd. Developer need to commit default.json file - with the new event after testing it out. + SIGINT to eventd to make it read and use the updated file. + Developer need to commit default.json file with the new event after testing it out. Supported severities are: CRITICAL, MAJOR, MINOR, WARNING and INFORMATIONAL. Supported enable flag values are: true and false.", "events":[ { - "name" : "CUSTOM_EVPROFILE_CHANGE", - "revision" : 0, - "severity" : "INFORMATIONAL", - "enable" : "true", - "message" : "Custom Event Profile is applied." + "name" : "SYSTEM_STATUS", + "revision" : 0, + "severity" : "INFORMATIONAL", + "enable" : "true", + "message" : "System Status Information" }, { "name": "TEMPERATURE_EXCEEDED", @@ -322,76 +291,48 @@ Developers of new events or alarms need to update this file by declaring name an ] } ``` -The format of event notify API is: - -definition: -``` - LOG_EVENT(name, source, action, MSG, ...) -``` -- name is name of the event -- source is the object that is generating this event -- action is either NOTIFY, RAISE_ALARM, CLEAR_ALARM, ACK_ALARM or UNACK_ALARM -- MSG can be json string. If json string, it is rendered as is in the syslog. - -Usage: -For one-shot events: -``` - LOG_EVENT(CUSTOM_EVPROFILE_CHANGE, profile_name.c_str(), NOTIFY, "New event profile is %s", profile_name.c_str()); -``` - -For alarms: -``` - if (temperature >= THRESHOLD) { - LOG_EVENT(TEMPERATURE_EXCEEDED, sensor_name_p, RAISE_ALARM, "Temperature for sensor %s is %d degrees", sensor_name_p, current_temp); - } else { - LOG_EVENT(TEMPERATURE_EXCEEDED, sensor_name_p, CLEAR_ALARM, "Temperature for the sensor %s is %d degrees ", sensor_name_p, current_temp); - } -``` #### 3.1.1.2 Development Process -Here is a typical developement process to link eventnotify library to a component and be able to send new events/alarms: +Declare the event-id of new event/alarm along with revision, severity, enable flag and static message in sonic-eventd/etc/evprofile/default.json -a. Update buildimage/rules/*app*.mk +Create a sonic-yang representation of the event as described in the producer framework. - Add $(LIBEVENTNOTIFY_DEV) to compile dependency. +In the source file, the event is published with action as RAISE_ALARM/CLEAR_ALARM (ACK_ALARM/UNACK_ALARM are used by mgmt-framework to allow users to acknowledge/unacknowledge alarms). - Add $(LIBEVENTNOTIFY) to runtime dependency. +The event publish api introduced by the producer framework is: +void event_publish(event_handle_t handle, const std:string &event_tag, + const event_params_t *params=NULL); -``` - Ex: For rules/tam.mk, +For further details of this API, refer to events-producer.md. +The following additional parameters to be given with this api: +1. action - If application has to raise an alarm an "action" attribute has to be given which this api call. +2. resource- The resource on which this event is raised. for e.g., interface name, ip address, etc. - $(SONIC_TAM)_DEPENDS += $(LIBEVENTNOTIFY_DEV) - $(SONIC_TAM)_RDEPENDS += $(LIBEVENTNOTIFY) -``` +For e.g call for port down event. +current call: + event_params_t params = {{"ifname",port.m_alias},{"status",isUp ? "up" : "down"}}; + event_publish(g_events_handle, "if-state", ¶ms); -b. Update Makefile.am of the app to link to event notify library. -``` - Ex: To let tammgr use event notify API, update src/sonic-tam/tammgr/Makefile.am as below: - - tammgrd_LDADD += -leventnotify -``` -c. Declare the name of new event/alarm along with revision, severity, enable flag and static message in sonic-eventd/etc/evprofile/default.json +new call: + event_params_t params = {{"ifname",port.m_alias},{"status",isUp ? "up" : "down"}, {"resource", port.m_alias}, {"event-id", "INTERFACE_OPER_STATUS_CHANGE"}, {"text", isUp? "status:UP" : "status:DOWN"}}; + event_publish(g_events_handle, "if-state", ¶ms); -d. In the source file where event is to be raised, include eventnotify.h and invoke LOG_EVENT with action as NOTIFY/RAISE_ALARM/CLEAR_ALARM (ACK_ALARM/UNACK_ALARM are used by mgmt-framework to allow users to acknowledge/unacknowledge alarms). - -The event notifier takes the event properties, packs a field value tuple and writes to a table, by name, EVENTPUBSUB. - -The EVENTPUBSUB table uses event-id and a sequence-id generated locally by event notifier as the key so that there wont be any conflicts across multiple applications trying to write to this table. +e.g., Sensor temperature critical high + event_params_t params = {{"event-id", "SENSOR_TEMP_CRTICAL_HIGH"}, {"text", "Current temperature {}C, critical high threshold {}C", {"action":"RAISE_ALARM"}, {"resource":"sensor_name"}}} ; + event_publish(g_events_handle, "sensor_temp_critical_high", ¶ms); ### 3.1.2 Event Consumer -The event consumer is a class in sonic-eventd container that processes the incoming record. +The event consumer is a class in EventDB service that processes the incoming events. On intitialization, event consumer reads */etc/evprofile/default.json* and builds an internal map of events, called *static_event_map*. -It then verifies if there was a custom event profile configured and merges its contents to static_event_map built from default event profile. -It then reads from EVENTPUBSUB table. This table contains records that are published by applications and waiting to be read by eventd. -Whenever there is a new record, event consumer reads the record, processes and deletes it. +It then subscribes to zmqproxy for events. -On reading the field value tuple, using the event-id in the record, event consumer fetches static information from *static_event_map*. +On reading the event, using the event-id in the record, event consumer fetches static information from *static_event_map*. As mentioned above, static information contains severity, static message and event enable flag. If the enable flag is set to false, event consumer ignores the event by logging a debug message. If the flag is set to true, it continues to process the event as follows: -- Generate new sequence-id for the event +- Get source, event-id, sequence id and dynamic msg from the published event. - Write the event to Event Table - It verifies if the event corresponds to an alarm - by checking the *action* field. If so, alarm consumer API is invoked for the event for further processing. - If action is RAISE_ALARM, add the record to ALARM table @@ -399,7 +340,7 @@ If the flag is set to true, it continues to process the event as follows: - If action is ACK_ALARM, update *acknowledged* flag of the corresponding raised entry to true in ALARM table and stores timestamp to *acknowledge_time*. - If action is UNACK_ALARM, update *acknowledged* flag of the corresponding raised entry to false in ALARM table and stores timestamp to *acknowledge_time*. - Event and Alarm Statistics tables are updated -- Invoke logging API to send a formatted message to syslog + #### 3.1.2.1 Severity Supported event severities: CRITICAL, MAJOR, MINOR, WARNING and INFORMATIONAL as defined opeconfig alarm yang. @@ -407,7 +348,7 @@ The corresponding syslog severities are: log-alert, log-crit, log-error, log-war Severity INFORMATIONAL is not applicable to alarms. #### 3.1.2.2 Sequence-ID -Every new event should have a unique sequential ID. The sequence-id is of the format <32 bit time_t><5 digit running sequence 00000 to 99999>. These semantics allows applications to layout the logs chronologically. +Every new event should have a unique sequential ID. The sequence-id is of the format <32 bit time_t><5 digit running sequence 00000 to 99999>. These semantics allows applications to layout the logs chronologically. Sequence Id is given by the published event. A unique id will be associated to every alarm. #### 3.1.2.3 Revision Every event/alarm defined in the profile must have a revision specified as a numerical. If not given, the default revision '0' is assigned to the event/alarm. This revision is to be incremented if the alarm parameters are updated. @@ -488,72 +429,11 @@ The *MINOR* alarm is also acknowledged by user. ALARM_STATS reads: Major as 0, M The *MAJOR* alarm is also unacknowledged by user. ALARM_STATS reads: Major as 1, Minor as 0. So it is now considered for system LED. System LED becomes Red. ### 3.1.4 Event Receivers -Supported NBIs are: syslog, REST and gNMI. - -#### 3.1.4.1 syslog -Logging API contains logic to take the event record, augment it with any static information, format the message and -send it to syslog. -``` - if (ev_act.empty()) { - const char LOG_FORMAT[] = "[%s], %%%s %s. %s"; - // event Type - // Event Name - // Static Desc - // Dynamic Desc - - // raise a syslog message - syslog(LOG_MAKEPRI(ev_sev, SYSLOG_FACILITY), LOG_FORMAT, - ev_type.c_str(), - ev_id.c_str(), ev_msg.c_str(), ev_static_msg.c_str()); - } else { - const char LOG_FORMAT[] = "[%s] (%s), %%%s %s. %s"; - // event Type - // event action - // Event Name - // Static Desc - // Dynamic Desc - // raise a syslog message - syslog(LOG_MAKEPRI(ev_sev, SYSLOG_FACILITY), LOG_FORMAT, - ev_type.c_str(), ev_act.c_str(), - ev_id.c_str(), ev_msg.c_str(), ev_static_msg.c_str()); - } -``` -An example of syslog message generated for an event raised when user selects a custom event profile. -``` -May 19 21:22:07.122786 2021 sonic WARNING eventd#eventd[2419]: [EVENT], %CUSTOM_EVPROFILE_CHANGE : handle_custom_evprofile: Custom Event Profile myprofile.json is applied.. Custom Event Profile is selected by user. -``` -Syslog message for an alarm raised by a sensor: -``` -May 19 21:42:14.373410 2021 sonic ALERT eventd#eventd[2453]: [ALARM] (RAISE), %TEMPERATURE_EXCEEDED : temperatureCrossedThreshold: Current temperature of sensor/2 is 76 degrees. Temperature threshold is 75 degrees. -``` -Syslog message when alarm is cleared is as follows: -``` -May 19 21:46:34.373693 2021 sonic ALERT eventd#eventd[2453]: [ALARM] (CLEAR), %TEMPERATURE_EXCEEDED : temperatureCrossedThreshold: Current temperature of sensor/2 is 70 degrees. Temperature threshold is 75 degrees. -``` -Syslog message when alarm with id=4 is acknowledged is as follows: -``` -May 19 21:48:05.870530 2021 sonic ALERT eventd#eventd[2453]: [ALARM] (ACKNOWLEDGE), Alarm id 4 ACKNOWLEDGE. -``` - -Syslog message when alarm with id=4 is unacknowledged is as follows: -``` -May 19 21:53:24.490545 2021 sonic ALERT eventd#eventd[2453]: [ALARM] (UNACKNOWLEDGE), Alarm id 4 UNACKNOWLEDGE. -``` -Operator can configure specifc syslog host to receive either syslog messages corresponding to events or general log messages. -Through CLI, operator can chose 'logging server [log|event]' command. -When operator configures a host with 'event' type, it receives *only* log messages corresponding to events. -Support for VRF/source-interface/UDP port are all are applicable for 'event' type. - -#### 3.1.4.2 REST -Subcribing through REST to receive event notifications is currently being evaluated. - -#### 3.1.4.3 gNMI -gNMI clients can subscribe to receive event notifications. Subscribed gNMI clients receive event fields as in the DB and -there is no customization of these fileds similar to syslog messages. +gNMI clients can subscribe to receive event notifications. TODO: add definitions of protobuf spec -#### 3.1.4.4 System LED +#### 3.1.4.1 System LED The original requirement was to change LED based on severities of the events. But on most of the platforms the system/power/fan LEDs are managed by the BMC. BMC (baseboard management controller) is an embedded system that manages various platform elements like fan, PSU, temperature sensors. There is an API that can be invoked to control LED, but not all platforms will support that API if they are fully controlled by the BMC. @@ -564,104 +444,18 @@ A mechanism must exist for one of these to be master, which, in this case, is pm The proposed solution is to have pmon use ALAMR_STATS counters in conjunction with existing logic to update system LED. -#### 3.1.4.5 Event/Alarm flooding +#### 3.1.4.2 Event/Alarm flooding There are scenarios when system enters a loop of a fault condition that makes application trigger events continuously. To avoid such instances flood the EVENT or ALARM tables, eventd maintains a cache of last event/alarm. Every new event/alarm is compared against this cache entry -to make sure it is not a flood. If it is found to be same event/alarm, the newly raised entry will be silently discarded. +to make sure it is not a flood. If it is found to be same event/alarm, the newly raised entry will be silently discarded. -#### 3.1.4.6 Eventd continuous restart -Under the scenarios when eventd runs into an issue and restarts continuously, applications might keep writing to the eventpubsub table. As consumer - eventd - is not able to remove events from the pusbsub table, eventpusbub table could grow forever as applications keep rising events/alarms. -One way to fix is to have the system monitor daemon to periodically (very high polling interval) to check the number of keys in the table and if it exceeds a number, delete all the entries. When system monitor daemon does this, it logs a syslog message. ### 3.1.5 Event Profile The Event profile contains mapping between event-id and severity of the event, enable flag. -Through event profile, operator can change severity of a particular event. And can also enable/disable -a particular event. +Through event profile, operator can set severity and enable/disable an event. The default profile exists at */etc/evprofile/default.json* -By default, every event is enabled. The severity of event is decided by developer while adding the event. -``` -{ - "__README__" : "This is default map of events that eventd uses. Developer can modify this file and send - SIGINT to eventd to make it read and use the updated file. Alternatively developer can test - the new event by adding it to a custom event profile and use 'event profile ' command - to apply that profile without sending SIGINT to eventd. Developer need to commit default.json file - with the new event after testing it out. - Supported severities are: CRITICAL, MAJOR, MINOR, WARNING and INFORMATIONAL. - Supported enable flag values are: true and false.", - "events":[ - { - "name" : "CUSTOM_EVPROFILE_CHANGE", - "revision" : 0, - "severity" : "INFORMATIONAL", - "enable" : "true", - "message" : "Custom Event Profile is applied." - }, - { - "name": "TEMPERATURE_EXCEEDED", - "revision" : 0, - "severity": "CRITICAL", - "enable": "true" - "message" : "Temperature threshold is 75 degrees." - } - ] -} -``` -User can download the default event profile to a remote host. User can modify characteristics of -some/all events in the profile and can upload it back to the switch and place the file at /etc/evprofile/. - -The uploaded profile will be called custom event profile. - -An example of custom event profile is as below. -With this particular custom event profile, user wants to -- change severity of CUSTOM_EVPROFILE_CHANGE event (severity changed from INFORMATIONAL to MAJOR) -- suppress the TEMPERATURE_EXCEEDED alarm (enable flag is changed from true to false) -- introduce new alarm by name DUMMY_ALARM (there should be an application to raise/clear this new alarm). -``` -{ - "events": [ - { - "name" : "CUSTOM_EVPROFILE_CHANGE", - "revision" : 0, - "severity" : "MAJOR", - "enable" : "true", - }, - { - "name": "TEMPERATURE_EXCEEDED", - "revision" : 0, - "severity": "CRITICAL", - "enable": "false" - }, - { - "name" : "DUMMY_ALARM", - "revision": 0 - "severity" : "WARNING", - "enable" : "true", - } - ] -} -``` - -User can have multiple custom profiles and can select any of the profiles under /etc/evprofile/ using 'event profile' command. - -The framework will sanity check the user selected profile and merges it map of events *static_event_map* maintained by eventd. - -After a successful sanity check, the framework generates an event indicating that a new profile is in effect. - -If there are any outstanding alarms in the alarm table, the framework removes those records for which enable is set to false in the new profile. -Severity counters in ALARM_STATS are reduced accordingly. - -Eventd starts using the merged map of characteristics for the all the newly generated events. A CUSTOM_EVPROFILE_CHANGE event is generated. - -The event profile is upgrade and downgrade compatible by accepting only those attributes that are *known* to eventd. -All the other attributes will remain to their default values. - -Sanity check rejects the profile if attributes contains values that are not known to eventd. - -Config Migration hooks will be used to persist the current active profile across an upgrade. - -The profile can also be applied through ztp. ### 3.1.6 CLI The show CLI require many filters with range specifiers. @@ -706,7 +500,6 @@ When either of the limit is reached, the framework wraps around the table by dis User can send SIGINT to eventd process to force read and apply the manifest limits. -The EVENTPUBSUB table will be periodically monitored and flushed based of a pre-defined table limit. Based on discussions this can be plugged into existing system jobs. An example of an event in EVENT table. ``` @@ -726,20 +519,24 @@ revision : Revision of the event {uint64} 127.0.0.1:6379[6]> hgetall "EVENT|1" - 1) "text" - 2) "handle_custom_evprofile: Custom Event Profile x.json is applied." - 3) "type-id" - 4) "CUSTOM_EVPROFILE_CHANGE" - 5) "id" - 6) "1" - 7) "time-created" - 8) "1621459327118629520" - 9) "resource" -10) "/etc/evprofile/x.json" +1) "time-created" +2) "1696888244771929600" +3) "type-id" +4) "PSU_POWER_STATUS" +5) "text" +6) "PSU 1 is out of power." +7) "action" +8) "RAISE" +9) "resource" +10) "PSU 1" 11) "severity" 12) "WARNING" -13) "Revision" -14)"0" +13) "id" +14) "21" +15) "acknowledged" +16) "false" +17) "revision" +18) "0" ``` Schema for EVENT_STATS table is as follows: @@ -845,39 +642,43 @@ The following filters are supported: - Records between two timestamps, one timestamp and end, and beginning and a timestamp. - All records between two Sequence Numbers (incl end and begin) -### 3.1.9 Supporting third party containers -To support third party components ( e.g. FRR, teamd, DHCP Relay, LLDPd, ntpd etc ) which can not be modified to raise events, the following options are considered -and are being evaluated. -1. Patch the components - Create a patch for these components by adding libeventnotify library and invoke the API. This however, requires these patches need to be maintained in the code forever. - -2. Listen to syslog messages - As many of these components raises syslog messages on an important event, a listener can be implemented to read incoming syslog messages and raise - events based on the message. - This however is heavy on performance due to the fact that listener has to parse each syslog message. Also listener need to maintain a map of messages to - event-id and need to be aware of resource and other specific details. It need to be aware of nuances of alarm raising/clearing if the component follows - any specific logic. - -Approach 1 is preferred. ## 3.2 DB Changes ### 3.2.1 EVENT DB -A new instance, redis4, is created and EVENT DB uses the new instance. -The following tables uses Event DB. -Table EVENTPUBSUB is used for applications to write events and for eventd to access and process them. Event Table (EVENT) and Alarm Table (ALARM) are used to house events and alarms respectively. To maintain various statistics of events, these two tables are used : EVENT_STATS and ALARM_STATS. -EVPROFILE table is used by mgmt-framework to communicate name of the custom event profile when configured through NBI. -Eventd reads the file name from this table and merges it with its static_event_map. ## 3.3 User Interface ### 3.3.1 Data Models The following is SONiC yang for events. ``` -module: sonic-event - +--rw sonic-event + +update existing sonic-events-comm.yang +Add attributes type-id and action. + grouping sonic-events-cmn { + .... + .... + leaf type-id { + type union { + type string; + type identityref { + base SONIC_EVENT_TYPE_ID; + } + } + description + "The abbreviated name of the alarm"; + } + leaf action { + type event-action; + description + "The action to operation on the event"; + } + } + +module: sonic-event-history + +--rw sonic-event-history +--rw EVENT | +--rw EVENT_LIST* [id] | +--rw id uint64 @@ -897,7 +698,7 @@ module: sonic-event +--rw cleared? uint64 rpcs: - +---x show-events + +---x show-event-history +---w input | +---w (option)? | +--:(time) @@ -996,23 +797,7 @@ module: sonic-alarm +--ro acknowledge-time? event:timeticks64 ``` -Following is for sonic yang to support event profiles. -``` -module: sonic-evprofile - - rpcs: - +---x get-evprofile - | +--ro output - | +--ro file-name? string - | +--ro file-list* string - +---x set-evprofile - +---w input - | +---w file-name? string - +--ro output - +--ro status? string -``` - -openconfig alarms yang is defined at [here](https://github.com/openconfig/public/blob/master/release/models/system/openconfig-alarms.yang) +openconfig alarms yang is defined [here](https://github.com/openconfig/public/blob/master/release/models/system/openconfig-alarms.yang) ### 3.3.2 CLI #### 3.3.2.1 Exec Commands @@ -1033,11 +818,6 @@ Un-acknowledging an alarm updates alarm statistics and thereby applications like The alarm record in the ALARM table is marked with acknowledged field set to false. There is acknowledge-time field that indicates when that alarm is un-acknowledged. -``` -sonic# event profile -``` -The command takes name of specified file, validates it for its syntax and values; merges it with its internal static map of events *static_event_map*. - ``` sonic# clear event ``` @@ -1045,76 +825,60 @@ This command clears all the records in the event table. All the event stats are The command will not affect alarm table or alarm statistics. Eventd generates an event informing that event table is cleared. -#### 3.3.2.2 Configuration Commands -``` -sonic(config)# logging server [log|event] -``` -Note: The 'logging server' command is an existing, already supported command. -It is only enhanced to take either 'log' or 'event' to indicate either native syslog messages or syslog messages corresponding to events alone are sent to the remote host. -Support with VRF/source-interface and configuring remote-port are all backward comaptible and will be applicable to either 'log' or 'event' options. - #### 3.3.2.3 Show Commands ``` -sonic# show event profile --------------------------- -Active Event Profile --------------------------- -myProfile.json --------------------------- -Available Event Profiles --------------------------- -default.json -myProfile.json -userProfile.json - sonic# show event [ details | summary | severity | start end | recent <5min|60min|24hr> | id | from to ] 'show event' commands would display all the records in EVENT table. sonic# show event ----------------------------------------------------------------------------------------------------------------------------- -Id Action Severity Name Timestamp Description ----------------------------------------------------------------------------------------------------------------------------- -1 - WARNING CUSTOM_EVPROFILE_CHANGE 2021-05-19T21:38:27.455Z handle_custom_evprofile: Custom Event Profile x.json is applied. -2 RAISE CRITICAL DUMMY_ALARM 2021-05-19T21:39:31.622Z signalHandler: Raising simulated alarm -3 CLEAR CRITICAL DUMMY_ALARM 2021-05-19T21:42:34.371Z signalHandler: Clearing simulated alarm -4 RAISE CRITICAL DUMMY_ALARM 2021-05-19T21:46:14.371Z signalHandler: Raising simulated alarm -5 ACKNOWLEDGE CRITICAL DUMMY_ALARM 2021-05-19T21:48:05.845Z Alarm id 4 ACKNOWLEDGE. -6 UNACKNOWLEDGE CRITICAL DUMMY_ALARM 2021-05-19T21:53:24.484Z Alarm id 4 UNACKNOWLEDGE. -7 CLEAR CRITICAL DUMMY_ALARM 2021-05-19T21:55:54.977Z signalHandler: Clearing simulated alarm + +---------------------------------------------------------------------------------------------------- +Id Action Severity Name Timestamp +---------------------------------------------------------------------------------------------------- +1 RAISE WARNING PSU_POWER_STATUS 2023-10-09T21:50:44.771Z +2 - INFORMATIONAL SYSTEM_STATUS 2023-10-09T21:51:02.784Z +3 RAISE CRITICAL DUMMY_ALARM 2023-15-19T21:39:31.622Z +4 CLEAR CRITICAL DUMMY_ALARM 2023-15-19T21:42:34.371Z +5 RAISE CRITICAL DUMMY_ALARM 2023-15-19T21:46:14.371Z +6 ACKNOWLEDGE CRITICAL DUMMY_ALARM 2023-15-19T21:48:05.845Z +7 UNACKNOWLEDGE CRITICAL DUMMY_ALARM 2023-15-19T21:53:24.484Z +8 CLEAR CRITICAL DUMMY_ALARM 2023-15-19T21:55:54.977Z + sonic# show event details ----------------------------------------------- -Event Details - 1 + +--------------------------------------------- +Event Details - 1 ---------------------------------------------- Id: 1 -Revision: 0 -Action: - +Action: RAISE Severity: WARNING -Type: CUSTOM_EVPROFILE_CHANGE -Timestamp 2021-05-19T21:38:27.455Z -Description: handle_custom_evprofile: Custom Event Profile x.json is applied. -Source: /etc/evprofile/x.json - +Revision: 0 +Type: PSU_POWER_STATUS +Timestamp: 2023-10-09T21:50:44.771Z +Description: PSU 1 is out of power. +Source: PSU 1 + ---------------------------------------------- -Event Details - 2 +Event Details - 2 ---------------------------------------------- -Id: 2 -Revision: 1 -Action: RAISE -Severity: CRITICAL -Type: DUMMY_ALARM -Timestamp 2021-05-19T21:39:31.622Z -Description: signalHandler: Raising simulated alarm -Source: simulation +Id: 2 +Action: RAISE +Severity: INFORMATIONAL +Revision: 0 +Type: SYSTEM_STATUS +Timestamp 2023-10-09T21:51:02.784Z +Description: System is ready +Source: system_status ---------------------------------------------- Event Details - 3 ---------------------------------------------- Id: 3 -Revision: 0 Action: CLEAR Severity: CRITICAL +Revision: 1 Type: DUMMY_ALARM Timestamp 2021-05-19T21:42:34.371Z Description: signalHandler: Clearing simulated alarm @@ -1237,52 +1001,60 @@ Acnowledged: 2 ### 3.3.3 REST API Support sonic REST links: -* /restconf/data/sonic-event:sonic-event/EVENT/EVENT_LIST -* /restconf/data/sonic-event:sonic-event/EVENT_STATS/EVENT_STATS_LIST +* /restconf/data/sonic-event:sonic-event-history/EVENT/EVENT_LIST +* /restconf/data/sonic-event:sonic-event-history/EVENT_STATS/EVENT_STATS_LIST * /restconf/data/sonic-alarm:sonic-alarm/ALARM/ALARM_LIST -* /restconf/data/sonic-alarm:sonic-alarm/ALARM_STATS/ALARM_STATS_LIST -* /restconf/operations/sonic-evprofile:get-evprofile -* /restconf/operations/sonic-evprofile:set-evprofile -* /restconf/operations/sonic-alarm:acknowledge-alarms * /restconf/operations/sonic-alarm:unacknowledge-alarms openconfig REST links: -* /restconf/data/openconfig-system:system/openconfig-events:events -* /restconf/data/openconfig-system:system/openconfig-events:event-stats +* /restconf/data/openconfig-system:system/openconfig-event-history:events +* /restconf/data/openconfig-system:system/openconfig-event-history:event-stats * /restconf/data/openconfig-system:system/alarms * /restconf/data/openconfig-system:system/openconfig-alarms-ext:alarm-stats -# 4 Flow Diagrams -![Sequence Diagram](event-alarm-framework-seqdiag.png) -# 5 Warm Boot Support -## 5.1 Application warm boot -Applications confirming to the warm boot, should have stored their state and compare current values against previous values. +# 4 Persistence +Alarms and Events are stored in ALARM and EVENT tables in a separate Redis DB instance called EventDB. +This instance is configured to periodically persist the EventDB to disk. +It is configured to persist 75 redis db events at 180 seconds. This is equal to ~5-6 Sonic Events. + + +## 4.1 Warm reboot +### 4.1.1 Application restart +Applications confirming to restart, should store their state and compare current values against previous values. Such compliant application also "remembers" that it raised an event before for a specific condition. They would -* not raise alarms/events for the same condition that it raised pre warm boot +* not raise alarms/events for the same condition that it raised pre restart. * clear those alarms once current state of a particular condition is recovered (by comparing against the stored state). -## 5.2 eventd warm boot -Records from applications are stored in a table, called EVENTPUBSUB. -Records that are being written will be queued when the consumer (eventd) is down. +### 4.1.2 System warm reboot +On system warm reboot the current EventDB is persisted on disk without the ALARM and ALARM_STATS table. Applications should check the condition after restart, and raise the alarm if condition exists. +Only EVENT table is persisted on disk across system warmboot. This overwrites the DB file from periodic persistence. + + +## 4.2 Fast reboot +On system fast reboot the current EVENT and EVENT_STATS table from EventDB are persisted on disk. ALARM and ALARM_STATS table are not persisted. Applications have to raise alarm on restart if condition exists. +The Event DB is stored on disk prior to control plane protocol services shutdown, to not impact fast-boot times. This overwrites the DB from periodic persistence. + +## 4.3 Cold reboot +The current EVENT and EVENT_STATS table are persisted on disk across cold boot. ALARM and ALARM_STATS table are not persisted, and applications have to raise alarm on restart if condition exists. This overwrites the DB from periodic persistence. -During normal operation, eventd reads, processes whenever a new record is added to the table. +## 4.4 Power reset +In power reset, the EventDB is loaded from the DB on disk. This DB is from periodic persistence. The ALARM and ALARM_STATS table is removed from the table. +Applications have to raise alarm on restart if condition exists. In this case, there can be events missing from previous boot, as the reset may have happened within the periodic persistence timer interval. -When eventd is restarted, events and alarms raised by applications will be waiting in a queue while eventd is coming up. -When eventd eventually comes back up, it reads those records in the queue. -# 6 Scalability +# 5 Scalability In this feature, scalability applies to Event Table (EVENT). As it is persistent and it records every event generated on the system, to protect against it growing indefinitely, user can limit its size through a manifest file. By default, the size of Event Table is set to 40k events or events for 30 days - after which, older records are discarded to make way for new records. -# 7 Showtech support +# 6 Showtech support The techsupport bundle is upgraded to include output of "show event recent 60min” and “show alarm all”. The first command displays all the events that were sent by applications for the last one hour. The second command displays all the alarms that are waiting to be cleared by applications (this includes alarms that were acknowledged by operator as well). -# 8 Unit Test +# 7 Unit Test - Raise an event and verify the fields in EVENT table and EVENT_STATS table - Raise an alarm and verify the fields in ALARM table and ALARM_STATS table - Clear an alarm and verify that record is removed from ALARM and ALARM_STATS tables are udpated @@ -1291,10 +1063,5 @@ The second command displays all the alarms that are waiting to be cleared by app - Verify wrap around for EVENT table ( change manifest file to a lower range and trigger that many events ) - Verify sequence-id for events is persistent by restarting - Verify counters by raising various alarms with different severities -- Change severity of an event through custom event profile and verify it is logged at specified severity -- Change enable/disable of an event through custom event profile and verify it is suppressed -- Verify custom event profile with an invalid severity is rejected -- Verify custom event profile with an invalid enable/disable flag is rejected -- Verify custom event profile is persisted after a reboot - Verify various show commands - Verify 'logging-server event' command forwards only event log messages to the host diff --git a/doc/event-alarm-framework/eventd-block.png b/doc/event-alarm-framework/eventd-block.png new file mode 100644 index 0000000000..c4e5bcf8bd Binary files /dev/null and b/doc/event-alarm-framework/eventd-block.png differ