Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event and Alarm Framework HLD #761

Closed
wants to merge 48 commits into from
Closed

Conversation

spenugondaa
Copy link

@spenugondaa spenugondaa commented Mar 16, 2021

Event and Alarm framework enables applications to raise notifications in response to a condition that may impact the health of the system that operator may be interested in.

  1. Define an API and allow applications to raise events
  2. Store the events in the DB
  3. Through CLI, REST and gNMI, user can fetch contents of the DB tables.
  4. Message corresponding to an event is sent to remote syslog host
  5. Allows operator to customize severities of events and enable/disable events through event profiles.
  6. Through management framework, gNMI can subscribe to receive updates as events are raised.

Corresponding Coding PRs:
Framework
Schema for Tables
mgmt-framework changes
mgmt-common changes

Signed-off-by: Srinadh Penugonda [email protected]

spenugondaa and others added 23 commits March 16, 2021 10:31
Allows applications to raise events; Stores them in the DB; allows NBIs to subscribe and fetch to monitor the device.

Signed-off-by: Srinadh Penugonda <[email protected]>
Signed-off-by: Srinadh Penugonda <[email protected]>
Signed-off-by: spenugondaa <[email protected]>
Signed-off-by: spenugondaa <[email protected]>
Signed-off-by: spenugondaa <[email protected]>
doc/event-alarm-framework/event-alarm-framework.md Outdated Show resolved Hide resolved
doc/event-alarm-framework/event-alarm-framework.md Outdated Show resolved Hide resolved
doc/event-alarm-framework/event-alarm-framework.md Outdated Show resolved Hide resolved
doc/event-alarm-framework/event-alarm-framework.md Outdated Show resolved Hide resolved
doc/event-alarm-framework/event-alarm-framework.md Outdated Show resolved Hide resolved
doc/event-alarm-framework/event-alarm-framework.md Outdated Show resolved Hide resolved
doc/event-alarm-framework/event-alarm-framework.md Outdated Show resolved Hide resolved
doc/event-alarm-framework/event-alarm-framework.md Outdated Show resolved Hide resolved
@renukamanavalan
Copy link
Contributor

Here some review comments.
[Sorry about joining late in the game]

1) Defaults.json can be part of image, as that helps adding new requirements, transparently. Adding new events is 
clean/easy, but re-defining/removing existing, need to follow some restrictions
    a. Each event in profile has two IDs
        i. Name (e.g. "TEMPERATURE_EXCEEDED")
        ii. ID - A numeric ID which is globally unique across time & space

    b. An image upgrade could like to redefine an event.
    For example, it may like to set temp limit for TEMPERATURE_EXCEEDED to 65 from original 75.
    If so, it updates its default.json, keeps the same name (e.g. TEMPERATURE_EXCEEDED) but updates 
    its numeric ID.

    Most of the consumers/clients are interested in by name only and often indifferent to the re-definition.
    This helps update/re-define of an event to be transparent to most. The few who might be interested,
    could use the numeric ID to distinguish, when messages flow from different sources running different
    releases that adapt different definitions.

    c. The image may delete some old events that are not applicable anymore and add new ones.
        The numeric ID of the deleted ones are never re-used. Hence leave the entries in defauts.json as deprecated.
        This is required as clients may still get these messages from older releases and they might use numeric
         ID for distinction.
        i. Along with enabled/disabled, have additional one as deprecated in Event profile definition
        ii. This would be useful when merging user-custom-profile as any user update for deprecated
            entries will be skipped/ignored.

2) Image upgrade will only carry forward currently configured custom profile -- not all profiles in /etc/evprofile

3)  Add "DEBUG" as one of supported severities.

4) There is a small grey area. The same event, could get raised as alarm or not, as that gets distinguished
    only by non-empty/empty action field. There is no reason/justification for an event to play dual role as
    alarm or not. But an app could fail to send action due to code bug, which can degrade an alarm to an
    event unintentionally.

Q: Can we classify an event as alarm or not in the event profile ?
     This removes the greyness.
     We can fail the event of type alarm w/o action

5) Dynamic message definition
     Often these messages are parsed by tools for required info, like i/f name in an i/f-down message.
     Across different releases the publisher-client could potentially re-phrase the message that can totally fail
     the clients that parse.
     
     So provide some schema, that is flexible yet stable w/o parsing cost.
    a.   A J2 template for human readable message, with named parameters
    b.   This template is part of event profile definition
    c.   Event publisher, only provides the named parameters & values as a JSON string.
    d.   Consumers could just take the named parameters along with Event name & ID and they get the info they need.
    e.   Consumers who need human readable message could use the J2 template to format the string
    f.   The rsyslog will get the formatted string

Pro:
    a.  The J2 template can be updated overtime and it would not impact the consumers.
    b.  The parameters can be added /removed across image releases and consumers could get what 
         they look for from the JSON string. Upgraded consumers would look for new ones and old ones
         will work with previous values and may find some missing.
    c. As publishing apps, only send params & values, the size of the message they send to syslog become smaller.

6) SEQUENCE_ID:
    Instead of 1 to 2^64, we could provide some semantic value to it as <32 bit time_t><5 digit running
    sequence 00000 to 99999>. Being 5 digits, we could use it globally across all events. This avoids the need
    for caching last used value per event. Coin as <time_t><next ID> for all events.

Pro:
    a.  Subscribing clients, when they restart, all they need is the timestamp from last run. With that info,
         they can set the start filter for event-id for query/subscribe.
    b.  Tools that collects events from multiple sources for combined analysis, sorting by Event-ID
          would help layout the logs chronologically.
    For an sample, while debugging two devices, we can see their events as ordered.

7) Manifest file for setting limits on table sizes, could cover all tables, including PUB/SUB.

8) Providing this feature with least code changes.
    Even for non III-party containers, like orchagent, syncd, <85> the adoption would be slow.
    To get over, one more option would be to use syslog filters
    Let the rsyslog parses the message
    Use omfwd to send out filtered messages
    The listener we add will create the events
    Here the listener still needs to parse to get some info out, but it will be limited to only those that are valid events.
  

@renukamanavalan
Copy link
Contributor

@spenugondaa can you please provide your response to my review comments?

@renukamanavalan
Copy link
Contributor

Look at this: SD-Data in rsyslog

This provides structured data & as well human readable format.

@yxieca yxieca force-pushed the master branch 2 times, most recently from 8498931 to 8837dc2 Compare April 15, 2022 16:51
1. Introduce revision id for an event.
2. Only active profile to be persisted on upgrade.
3. SequenceId to have some sematics <time_t><id>
@bhaveshdell
Copy link
Contributor

@renukamanavalan Comments addressed. Please review and approve changes.

@bhaveshdell
Copy link
Contributor

@yozhao101 Comments addressed. Please review and approve changes.

@renukamanavalan
Copy link
Contributor

@qiluo-msft @zhenggen-xu
Can you please review and provide your approval?

@renukamanavalan
Copy link
Contributor

@qiluo-msft @zhenggen-xu
Can you please review and provide your approval?

@jeff-yin
Copy link

@renukamanavalan @zhangyanzhao -- any way to ping @qiluo-msft @zhenggen-xu internally to review/approve/merge?

@renukamanavalan
Copy link
Contributor

Please ping me at @[email protected]

@bhavini-gada
Copy link

@qiluo-msft - Can you please help merge this HLD ?

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 8, 2022

CLA Missing ID CLA Not Signed

@zhangyanzhao
Copy link
Collaborator

close this PR and use #1064 to continue to track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants