Skip to content

Commit

Permalink
Update Using_Linux_in_a_Safe_System.md
Browse files Browse the repository at this point in the history
Some clarifications and a new section about common mistakes

Signed-off-by: Igor Stoppa <[email protected]>
  • Loading branch information
igor-stoppa authored Apr 10, 2024
1 parent 180d8b7 commit b2bd9c9
Showing 1 changed file with 139 additions and 45 deletions.
184 changes: 139 additions & 45 deletions Contributions/Using_Linux_in_a_Safe_System.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,32 @@ considerations, methods and practices still apply.

## **Purpose of the document**
The considerations provided here are meant to help with designing a safe
system incorporating Linux.
system incorporating Linux, as an element.

There are many reasons why one might want to use Linux in a system, but its
safety is certainly not one of them - nor should it be.
And "Linux" refers to what, sometimes, is also called "the Linux kernel".
This doesn't preclude using also what can be typically found in a GNU/Linux
distribution, but the choice of user-space components can be different, across
different distributions.

Linux is neither conceived, nor evolved, to be safe. To be fair, it has
One could even have a custom system comprised of the Linux kernel and just one
statically linked self-contained binary program, and that would still count
as use of Linux.

Why using Linux for safety applications? There are many reasons why one might
want to use Linux in a system, from cost savings (after all, it's free),
flexibility, support of many different hardware components, etc.
But its safety is certainly not one of them - nor should it be.

Linux is neither conceived, nor it has evolved, to be safe. To be fair, it has
never made any claim to be safe, therefore it should not be expected to be safe.
Safety is not something that happens by accident. These considerations are
intended to help with designing a system that is safe **despite** using Linux.
Safety is not something that happens by accident.
Certainly it is developed with very high quality standards, and it would be
foolish and ungenerous to claim otherwise.

But Quality Management is not a substitute for safety.

These considerations are intended to help with designing a system that is safe
**despite** using Linux.

All of the background justification, related to hardware (ARM64), Linux kernel
architecture, and allocation of safety requirements, can be found in the documents
Expand All @@ -72,15 +89,20 @@ insufficient, depending on where/how it is applied.
- The document starts with a section containing design considerations.
- The following section contains several examples of how to apply such
considerations to the design of the Linux-bearing system.
- The successive section, instead, contains considerations about the risk
in not fully understanding what certain system features can and cannot
provide.


## **Considerations**

**Premise:** some of the following considerations assume a fairly high level
of freedom in designing the overall system.
While one must acknowledge that in practice this freedom will not be always
available, due to design constraints, it is equally true that attempting to
cope with these constraints can lead to an even higher total cost of ownership.
While one must acknowledge that in practice such high level of freedom will
not always be available, due to design constraints, it is equally true that
attempting to cope with constraints can lead to an even higher total cost of
ownership.


* **Clearly define requirements toward system integrity and availability, upfront**

Expand All @@ -93,7 +115,7 @@ cope with these constraints can lead to an even higher total cost of ownership.
allotted time, is usually significantly simpler than having a stricter requirement
to guarantee that a certain functionality will not be interrupted.

Therefore, the intended use-case plays a major role in curbing potential
Therefore, the intended use-cases play a major role in curbing potential
system designs.

* **Linux is not intrinsically safe**
Expand All @@ -113,9 +135,10 @@ cope with these constraints can lead to an even higher total cost of ownership.

* **Aggressively placing safety requirements on anything but (components of) the Linux kernel**

The natural consequence of the previous point is that one should explore - given a
specific set of safety requirements for the system as a whole, if they could be
satisfied through some component other than Linux (more on this later).
The natural consequence of the previous point is that - given a specific set of
safety goals for the system as a whole - one should explore if components other
than Linux could be the recipient of the ensuing safety requirements
(more on this later).

There are several additional reasons for favouring an external component, besides
what has been mentioned already:
Expand All @@ -126,9 +149,18 @@ cope with these constraints can lead to an even higher total cost of ownership.
* other non-technical reasons (e.g. different licensing)

A specific application or use case, will have a broader set of requirements,
like TCO, MTBF, weight, repairability, just to mention a few, and this can
and will affect the decision of what's the optimal compromise between
allocating specific safety requirements on either Linux or other components.
like Total Cost of Ownership, Mean Time Before Failure, weight, repairability,
just to mention a few, and this can and will affect the decision of what's the
optimal compromise between allocating specific safety requirements on either
Linux or other components.

Placing safety requirements on a software component means that also the hardware
executing said software will be involved in supporting the associated safery claim.
Safety qualified hardware tends - unsurprisingly - to be more expensive, so
it might be more cost effective to place safety requirements on a smaller, cheaper
component, than on processor executing Linux.

It really depends on the application.

## **Examples of Safety-oriented System Design Concepts**

Expand All @@ -143,50 +175,51 @@ Interference affecting a component with safety requirements must be both detecte
and reported, in a way that is compatible with the associated time requirements.
At this point, appropriate action(s) will be taken, in accordance with the
pre-established policy, to ensure adequate availability
(e.g. restarting, onlining of backup system, etc.)
(e.g. powering off, restarting, onlining of backup system, etc.)

As mentioned earlier, guaranteeing availability can be much more difficult
and/or costly.

* **Application Monitoring**

Monitoring is a simple approach to detection of interference that affects
component with safety requirement.
components with safety requirement.

It can save from extensive hardening.

However three conditions must be met:
However these conditions must be met:
1. It must be possible to express a check, in a way that can
systematically, reliably and unequivocally produce a YES/NO verdict
about the health parameter of interest.
2. The check, while rigorous, must be sufficiently lightweight to be
compatible with the timing requirements, so that it can ensure the
reaction will happen within the time constraint expressed by the
system at hand.
3. The observation whether the condition is met (or not) can be performed from a
context enjoying same or better safety qualification, provided that
3. The observation whether the condition is met (or not) must be performed from a
context enjoying same or better safety qualification, provided also that
it is immune from the interference it is meant to detect.
And the context is able to take whatever action might be dictated by the
specific policy in place.
And, of course, the context must be able to enact/trigger whatever action might be
dictated by the specific policy in place.

Assuming that these two conditions are met (e.g. the system contains either a
Assuming that these conditions are met (e.g. the system contains either a
watchdog or an equivalent component -- more on this later) then the opportunistic
approach is to identify an high-level activity that can be tied to the monitor.
approach is to identify a high-level activity that can be tied to the monitor.

A very crude example of this validation could be to require that an application
responsible for performing safety-relevant operations must ping a watchdog periodically.

As long as the application pings the watchdog within the set deadline, the watchdog
does not trigger any exceptional operation.

The example is crude, but it illustrates how - for example - all the internal Linux
The example is indeed crude, but it illustrates how - for example - all the internal Linux
components involved with booting, loading and executing the application might arguably
not need to be subject to any safety requirement, provided that the final product of
their operation can be monitored to ensure that it satisfies the required criteria.
their operation (running the application with safety requirements) can be
monitored to ensure that it satisfies the required criteria.

In this case one must accept, though, that safety claims are possible only for
One must understand and accept, though, that safety claims are possible only for
what is being monitored, either directly or indirectly.
In the example, the WD can only directly detect that there has been some
In the example, the watchdog can only directly detect that there has been some
form of temporal interference. The cause might have been anything, as
long as it resolved into a noticeable temporal interference, e.g. through
cascading.
Expand All @@ -200,14 +233,26 @@ and/or costly.

Priming/pinging the watchdog becomes the mechanism for meeting (some) safety
requirements. Involving component that could otherwise be avoided is
likely to resul in additional unnecssary, expensive, tedious, error-prone
likely to result in additional unnecessary, expensive, tedious, error-prone
work.

It is usually far more effective to implement multi-stage monitoring
mechanisms, to ensure that the core component remains simple and easy to qualify,
while additional complexity of the validation can be achieved by chaining multiple
monitoring processes, terminating at the bottom with a very simple one.

It is equally critical that the aspect being monitored is verified to be
adequately correlated to the safety goal.
E.g. monitoring the periodic execution of a given process doe not guarantee
that some of its memory has been corrupted, and that the process will not
result in unsafe actions.

Similarly, should the system have a goal to initialise, from a kernel driver,
a given hardware component, it might be necessary to introduce a requirement
for the device driver(s) involved, to ping the watchdog, during boot.

Safety requirements really do stem from safety goals, and this cannot be overstated.

* **Watchdogs / Safety Islands / Other Exception Contexts**

Continuing from the previous point, there are three typical design approaches,
Expand All @@ -223,34 +268,37 @@ and/or costly.
following policies:
* NMI: the watchdog generates a Not Maskable Interrupt, which forces a core
to execute a predefined interrupt handler. The handler runs with its own
sanitsied context, and can be assumed to be in a sane/safe/secure state.
sanitised context, and can be assumed to be in a sane/safe/secure state.
However, it retains also the access to the system state.

One could rely on the NMI interrupt service routine to take over and
initiate a reset, or power down, or whatever is required.
* reset: it usually consists in the electrical state of the system being
taken to its default values, paired with software re-initialisation.
Some systems might allow for partial reset, e.g. the outputs are "held",
while the inner logic is re-initialised.
* power down: power is cut off, at least from the components associated
to safety requirements.
to the (unmet) safety requirements.

This is fairly boilerplate information, but it is worth mentioning that
safety requirements apply not only to software but to hardware as well.
The Watchdog IP must be safety qualified, to be relied on.
And so must be the actuator it is supposed to control, e.g. the switch
cutting the power.

However, for the watchdog to support safety claims, the following must
be proven:
* both design and implementation of the WD is compatible with the
For the watchdog to support safety claims, the following must be proven:
* both design and implementation of the watchdog is compliant with the
system safety requirements
* the WD must be adequately isolated from the rest of the system at
physical level, so that the chance that physical anomalies
(e.g. temperature, voltage, current) will reach it is in line with
* the watchdog must be adequately isolated from the rest of the system
at physical level, so that the chance that physical anomalies
(e.g. temperature, voltage, current) will reach it, is in line with
the safety requirements.
* the initialisation of the watchdog can be considered sufficiently safe
(same or better safety level than the required one)
* it must be impossible for a single logical failure in the system to
make it ineffective; this case covers:
* somehow a failure either disables or relaxes either the time
constraint of watchdog or action-upon-expiring
constraint of watchdog or its expected action-upon-expiring
* somehow a failure triggers one or more pings to the watchdog,
when instead it should expire

Expand All @@ -276,7 +324,7 @@ and/or costly.
typically involved in booting Linux, the bootloader.

A bootloader is usually a fairly small software component that is
invoked directly by the boot rom (other cases are possible and
invoked directly by the bootrom (other cases are possible and
discussed below). The bootloader, provided that it is at least as safe
as the overall required safety level, could initialise the watchdog,
and then pass the control to the Linux boot, confident that the watchdog
Expand Down Expand Up @@ -317,11 +365,12 @@ and/or costly.
"Safety Island" is a fancy name for referring to what is typically a
special core, within a System on a Chip, that can be used as foundation
for building safety claims. Characteristics are:
* Its HW design, implementation and validation are compatible with safety
requirements. This means that it is intentionally isolated by other
components in the same SoCi, to fulfil the independence requirements
needed when decomposing safety requirements or when protecting safety
workloads from the interference of other SW/HW elements.
* Its hardware design, implementation and validation are compatible with
safety requirements. This means that it is intentionally isolated by
other components in the same System on a Chip, to fulfil the
independence requirements needed when decomposing safety requirements
or when protecting safety workloads from the interference of other
software/hardware elements.
* Usually is a microcontroller core
* Its operating parameters cannot be affected by other components in
the system (e.g. the cores running Linux).
Expand Down Expand Up @@ -363,6 +412,51 @@ and/or costly.
of the SoC watchdog toward multiple operating systems, however it
could go beyond that, and reach higher levels of complexity.

## **Hardware features are not a replacement for good design**

***Against stupidity the gods themselves contend in vain.***

Throwing expensive hardware features at a problem will not necessarily
be sufficient, or even useful.
It is always advised to understand very well what a certain feature
might - or might not - do.
Some examples of misplaced faith in features that have a different
purpose from what one might think:

* ARM64 (E)PAN/PXN: these features prevent the kernel from abusing the
page table of a process to access memory pages with the same mappings
that the process would see. However, they do precisely nothing to
prevent the kernel from writing to those very same pages through the
linear map. These are security features meant to complicate the life
of an attacker, who would probably need to use the process page table,
to make any sense of their content. But a "successful" interference
only needs to alter the memory content. These features won't help.

* ECC memory: it helps detecting or even correcting certain classes of
hardware errors. Furthermore it can even hamper the execution of
security attacks, like the row hammer. However, it will not do anything
to prevent or even just detect a spatial interference coming from a
logical component (e.g. device driver accidentally altering the memory
of a process through the kernel linear map).

* Watchdog / Safety Island: they do help in monitoring that some
semi-periodic events will happen, as defined, however it's also
important to have a correct pinging strategy.
For example, in a multi thread application, any thread running with
lower priority than the watchdog pinging thread can starve undetected.
Should any of these threads be relevant for safety, its starvation will
go unnoticed to the pinging strategy.

* RealTime features: enabling full kernel preemption and realtime
capability is not a magic wand that substitutes understanding system
constraints and requirements. To use it correctly, one must appreciate
the fact that what it really means is to guarantee an upper bound to
the response time to certain events, assuming that overall priority
is not assigned in a way that can cause starvation, nor priority
inversion.



## **License: CC BY-SA 4.0**

### **DEED**
Expand Down

0 comments on commit b2bd9c9

Please sign in to comment.