Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Kernel_Self_Monitoring.md #44

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions Contributions/Kernel_Self_Monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# **Kernel Self Monitoring**

## Index
[References](#References)

[Purpose of the Document](#Purpose-of-the-Document)

[Structure of the Document](#Structure-of-the-Document)

[Introduction - Linux Kernel: practical safety problems](#Introduction---Linux-Kernel-practical-safety-problems)

[The concept - Safety-oriented self monitoring](#The-concept---Safety-oriented-self-monitoring)

[Constraints for a safety-oriented kernel self monitor](#Constraints-for-a-safety-oriented-kernel-self-monitor)

[Targets for self monitoring activity](#Targets-for-self-monitoring-activity)


## **References**

## **Purpose of the Document**
These are guidelines for safety-hardening certain aspects of the Linux kernel, through self monitoring.

They can act as a touchstone against candidate designs, to avoid known pitfalls.

## **Structure of the document**
- The document presents a brief overview of what safety problems can be found while using Linux, and how they can be mitigated through self-monitoring.
- The next sections, instead, discuss more in detail typical pitfalls that must be avoided, to make the monitoring useful and what goals should be targeted.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove ", instead,"


## **Introduction - Linux Kernel: practical safety problems**
The Linux Kernel has many desirable features that make it a prime candidate for many use cases.

However, its lacks of both an intrinsic safety design and safety-oriented features make it challenging to use it as-it-is, for safety applications.

Therefore, there is a risk of having a Linux-based system to experience undetected internal interference, which might compromise the set of safety goals the system must support.

It is a customary mitigation to introduce system-level monitoring, in various forms.

To mention a couple of examples:
- Watchdogs, to detect abnormal delays either in processing events or performing actions.
- Output vetting, to detect abnormal actuator control signals produced by the system.

These external safety measures, though, are not always sufficient, because they might have limited ability to observe the internal status of the kernel, or it might be desirable to exert a tighter control over the evolution of its internal states.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the statement above I would add "especially in scenarios where the Kernel is used to enable and manage complex safety workloads"

This is where the concept of kernel self monitoring comes into the picture.


## **The concept - Safety-oriented self monitoring**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the title fit for the content. The content could be a continuation of the previous section, as you are still introducing the subject. With this title I would expect you to explain the concept, which is not the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree on the technical content, I think that in this doc we are missing the theoretical principles, that BTW are quite simple:

  1. The Self Monitoring must be developed or qualified with a systematic capability level that is equal or higher than the ASIL or SIL level associated with the safety claim that it supports
  2. The scope of the monitored part of the Kernel and the scope of the self monitoring part shall be clearly identified, including part of the design or code that can be common between the two
  3. A safety analysis on the monitored code shall define the dangerous failure modes originating from it that would violate the target safety claim
  4. Assuming the lack of interference from the monitored code and from any other code running in the Kernel, the safety monitor shall be verified to be effective against a subset of the dangerous failure modes mentioned in 3). Any residual dangerous failure mode shall be covered by additional mitigations (that are out of the scope of this document)
  5. A comprehensive analyses on the possible interference failure modes originating from the monitored code or any other code running in the Kernel shall be carried over. With respect to such analysis the dangerous interference failure modes (leading to a violation of the target safety claim) shall be identified and mitigated through adequate measures.
    For example adequate measures could range from Assumption on Use or proper configurations that would lead to the avoidance of such failure modes, designing and implementing additional detection mechanism or the qualification of the code generating interference up to a level of systematic capability that is equal or higher than ASIL or SIL allocated to the target safety claim.

Note: When designing the safety monitors and any additional mitigation measure, also availability, security and performance requirements shall be considered.
E.g. a measure making the overall safety function unavailable is perfectly safe but it would be useless

The previous section highlights the need for a reliable way of allocating at least some safety goals to the Linux Kernel itself, even if not the entire spectrum of system safety requirements.

The following considerations are not specific to Linux, but nonetheless are relevant:
- Safety Requirements are “Contagious”
Whatever mechanism, or combination of mechanisms, might be chosen to support safety goals, it must be proven that it is exempt from the form of interference that it is supposed to help detect/deflect.
At the very least, it must be used in such a way that said interference cannot compromise its detection/deflection activity.
- From a safety perspective, not all types of interference are equal.
For example, assuming that it’s acceptable to have interruptions of availability, as long as unsafe behavior is avoided, it can be classified as:
- “Acceptable” interference:
- Interference that causes misbehavior that can be detected/deflected through external mechanisms (as described in the previous section)
- Interference that does not propagate to safety-relevant actuators.
- “Non-acceptable” interference:
- Interference that causes undetected interruption or degradation in the upholding of safety goals.

E.g. a passive safety feature like an airbag goes silently offline.
- Interference that causes outright dangerous malfunctions.

E.g. a car self steers in the wrong lane.
- Different levels of safety hardening are possible, and similar results can be reached through different measures; design-time tradeoff and constraint play a major role in deciding how to approach the problem.

E.g.:

- A user-space process can send a message to a smart actuator, with built-in hardening, to at least detect corruption that the kernel might introduce:
- Checksumming the message provides a degree of resilience against corruption.
- Adding a timestamp / progressive message ID, and requiring communication to happen with a minimum periodicity can help with the detection of messages being either discarded or delivered late/out of order.

However, this design requires an ad-hoc user mode device driver, and a peripheral that can handle the complex communication protocol.
- Should the previous point be not feasible, for any reason, then the question remains of how this could be achieved by enhancing the kernel.

Pulling together what discussed so far, it is possible to formulate a set of requirements for whatever component might step up to provide equivalent safety capability. And that is what the next section will discuss.

## **Constraints for a safety-oriented kernel self monitor**
As mentioned earlier, the self monitor needs to meet certain requirements, which are formulated from the perspective of hardening it to the same interference it is supposed to detect.
- The self monitor itself can incur into failure:
- Due to interference from some other component.
- Due to its own bugs.
Such events must be detectable; e.g.by an additional, external component.

Typically, it could be a watchdog of some kind:
- An external HW watchdog.
- An external core, running safety-qualified code (e.g. a safety island)
- An external execution context running on the same system (e.g. an hypervisor)
- The failure analysis must consider that the meta-data used by the self-monitor is exposed to interference as well:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you mean the Kernel meta-data here. Perhaps explicitly stating that the meta-data that is to be monitored is the kernel internal data structures, as opposed to userspace data?
I kind of miss the information of what the self monitor is supposed to monitor. Can we place a requirement on that?

- Mistakenly identifying a failure in the metadata as safety-relevant, when instead it is not, lowers the availability, but it is considered safe.
- Failing to recognise a safety-relevant failure, instead, is not acceptable.
- Whatever mechanisms the self monitor might employ, they must be designed in a way that
- does not fail silently: it is acceptable that they might fail, when exposed to the same type of interference, but this must happen:
- in a noticeable way
- within the maximum time allowed for detecting failures
- does not introduce hazards of its own

The safety requirements for the safety monitor itself can be summarised with:
- use hardened metadata in some form, e.g.:
- use redundancy, for non write-protected data
- write-protect its data, to avoid its silent corruption
- avoid depending on the safe behaviour of other kernel subsystems and features, e.g.:
- semaphores, scheduler, memory allocators
- avoid altering the system, e.g. through patching its code and/or data

## **Targets for kernel self monitoring activity**
As discussed, external means can be employed, to monitor the system for lack of functionality, however that only addresses one aspect of misbehaviour of a black-box system.

The other important aspect is emergence of the effects of internal interference, at the interface between the system and the rest of the world. In this case, those interfaces used to control actuators.
One can argue that, as long as external monitors check for loss of availability, internal self-interference is not a threat, if it doesn't manifest as unsafe controls for actuators.

The simplest form of self monitoring consists, thus, in sanitising output values, before they reach an actuator and result in unsafe actions. Of course, should the accctuator be capable of vetting the controls it receives, and signalling anomalies, it would not be necessary to perfom kernel self monitoring. But, in presence of "dumb" actuators, the kernel self monitoring assumes the role of safety net, provided it meets the conditions previously described.

The actual monitoring function to apply to each actuator control is heavily depending on said actuator, but they can be seen as:
- combinatory or without memory: the output at any given time is determined by the inputs at the very same time
- with memory: the output at any given moment is also affected by previous values, and vetting it becomes both more difficult and expensive.
Loading