Update Using_Linux_in_a_Safe_System.md

Some clarifications and a new section about common mistakes Signed-off-by: Igor Stoppa <[email protected]>
elisa-tech · Apr 10, 2024 · b2bd9c9 · b2bd9c9
1 parent 180d8b7
commit b2bd9c9
Showing 1 changed file with 139 additions and 45 deletions.
diff --git a/Contributions/Using_Linux_in_a_Safe_System.md b/Contributions/Using_Linux_in_a_Safe_System.md
@@ -40,15 +40,32 @@ considerations, methods and practices still apply.
 
 ## **Purpose of the document**
 The considerations provided here are meant to help with designing a safe
-system incorporating Linux.
+system incorporating Linux, as an element.
 
-There are many reasons why one might want to use Linux in a system, but its
-safety is certainly not one of them - nor should it be.
+And "Linux" refers to what, sometimes, is also called "the Linux kernel".
+This doesn't preclude using also what can be typically found in a GNU/Linux
+distribution, but the choice of user-space components can be different, across
+different distributions.
 
-Linux is neither conceived, nor evolved, to be safe. To be fair, it has
+One could even have a custom system comprised of the Linux kernel and just one
+statically linked self-contained binary program, and that would still count
+as use of Linux.
+
+Why using Linux for safety applications? There are many reasons why one might
+want to use Linux in a system, from cost savings (after all, it's free),
+flexibility, support of many different hardware components, etc.
+But its safety is certainly not one of them - nor should it be.
+
+Linux is neither conceived, nor it has evolved, to be safe. To be fair, it has
 never made any claim to be safe, therefore it should not be expected to be safe.
-Safety is not something that happens by accident. These considerations are
-intended to help with designing a system that is safe **despite** using Linux.
+Safety is not something that happens by accident.
+Certainly it is developed with very high quality standards, and it would be
+foolish and ungenerous to claim otherwise.
+
+But Quality Management is not a substitute for safety.
+
+These considerations are intended to help with designing a system that is safe
+**despite** using Linux.
 
 All of the background justification, related to hardware (ARM64), Linux kernel
 architecture, and allocation of safety requirements, can be found in the documents
@@ -72,15 +89,20 @@ insufficient, depending on where/how it is applied.
 - The document starts with a section containing design considerations.
 - The following section contains several examples of how to apply such
   considerations to the design of the Linux-bearing system.
+- The successive section, instead, contains considerations about the risk
+  in not fully understanding what certain system features can and cannot
+  provide.
 
 
 ## **Considerations**
 
 **Premise:** some of the following considerations assume a fairly high level
 of freedom in designing the overall system.
-While one must acknowledge that in practice this freedom will not be always
-available, due to design constraints, it is equally true that attempting to
-cope with these constraints can lead to an even higher total cost of ownership.
+While one must acknowledge that in practice such high level of freedom will 
+not always be available, due to design constraints, it is equally true that
+attempting to cope with constraints can lead to an even higher total cost of
+ownership. 
+
 
 * **Clearly define requirements toward system integrity and availability, upfront**
 
@@ -93,7 +115,7 @@ cope with these constraints can lead to an even higher total cost of ownership.
   allotted time, is usually significantly simpler than having a stricter requirement
   to guarantee that a certain functionality will not be interrupted.
 
-  Therefore, the intended use-case plays a major role in curbing potential
+  Therefore, the intended use-cases play a major role in curbing potential
   system designs.
 
 * **Linux is not intrinsically safe**
@@ -113,9 +135,10 @@ cope with these constraints can lead to an even higher total cost of ownership.
 
 * **Aggressively placing safety requirements on anything but (components of) the Linux kernel**
 
-  The natural consequence of the previous point is that one should explore - given a
-  specific set of safety requirements for the system as a whole, if they could be
-  satisfied through some component other than Linux (more on this later).
+  The natural consequence of the previous point is that - given a specific set of
+  safety goals for the system as a whole - one should explore if components other
+  than Linux could be the recipient of the ensuing safety requirements
+  (more on this later).
 
   There are several additional reasons for favouring an external component, besides
   what has been mentioned already:
@@ -126,9 +149,18 @@ cope with these constraints can lead to an even higher total cost of ownership.
   *  other non-technical reasons (e.g. different licensing)
 
   A specific application or use case, will have a broader set of requirements,
-  like TCO, MTBF, weight, repairability, just to mention a few, and this can 
-  and will affect the decision of what's the optimal compromise between
-  allocating specific safety requirements on either Linux or other components.
+  like Total Cost of Ownership, Mean Time Before Failure, weight, repairability,
+  just to mention a few, and this can and will affect the decision of what's the
+  optimal compromise between allocating specific safety requirements on either
+  Linux or other components.
+
+  Placing safety requirements on a software component means that also the hardware
+  executing said software will be involved in supporting the associated safery claim.
+  Safety qualified hardware tends - unsurprisingly - to be more expensive, so
+  it might be more cost effective to place safety requirements on a smaller, cheaper
+  component, than on processor executing Linux.
+
+  It really depends on the application. 
 
 ## **Examples of Safety-oriented System Design Concepts**
 
@@ -143,50 +175,51 @@ Interference affecting a component with safety requirements must be both detecte
 and reported, in a way that is compatible with the associated time requirements.
 At this point, appropriate action(s) will be taken, in accordance with the
 pre-established policy, to ensure adequate availability
-(e.g. restarting, onlining of backup system, etc.)
+(e.g. powering off, restarting, onlining of backup system, etc.)
 
 As mentioned earlier, guaranteeing availability can be much more difficult
 and/or costly.
 
 * **Application Monitoring**
 
   Monitoring is a simple approach to detection of interference that affects
-  component with safety requirement.
+  components with safety requirement.
 
   It can save from extensive hardening.
 
-  However three conditions must be met:
+  However these conditions must be met:
   1. It must be possible to express a check, in a way that can
      systematically, reliably and unequivocally produce a YES/NO verdict
      about the health parameter of interest.
   2. The check, while rigorous, must be sufficiently lightweight to be
      compatible with the timing requirements, so that it can ensure the
      reaction will happen within the time constraint expressed by the
      system at hand.
-  3. The observation whether the condition is met (or not) can be performed from a
-     context enjoying same or better safety qualification, provided that
+  3. The observation whether the condition is met (or not) must be performed from a
+     context enjoying same or better safety qualification, provided also that
      it is immune from the interference it is meant to detect.
-     And the context is able to take whatever action might be dictated by the
-     specific policy in place.
+     And, of course, the context must be able to enact/trigger whatever action might be
+     dictated by the specific policy in place.
 
-  Assuming that these two conditions are met (e.g. the system contains either a
+  Assuming that these conditions are met (e.g. the system contains either a
   watchdog or an equivalent component -- more on this later) then the opportunistic
-  approach is to identify an high-level activity that can be tied to the monitor.
+  approach is to identify a high-level activity that can be tied to the monitor.
 
   A very crude example of this validation could be to require that an application
   responsible for performing safety-relevant operations must ping a watchdog periodically.
 
   As long as the application pings the watchdog within the set deadline, the watchdog
   does not trigger any exceptional operation.
 
-  The example is crude, but it illustrates how - for example - all the internal Linux
+  The example is indeed crude, but it illustrates how - for example - all the internal Linux
   components involved with booting, loading and executing the application might arguably
   not need to be subject to any safety requirement, provided that the final product of
-  their operation can be monitored to ensure that it satisfies the required criteria.
+  their operation (running the application with safety requirements) can be
+  monitored to ensure that it satisfies the required criteria.
 
-  In this case one must accept, though, that safety claims are possible only for
+  One must understand and accept, though, that safety claims are possible only for
   what is being monitored, either directly or indirectly.
-  In the example, the WD can only directly detect that there has been some
+  In the example, the watchdog can only directly detect that there has been some
   form of temporal interference. The cause might have been anything, as
   long as it resolved into a noticeable temporal interference, e.g. through
   cascading.
@@ -200,14 +233,26 @@ and/or costly.
 
   Priming/pinging the watchdog becomes the mechanism for meeting (some) safety
   requirements. Involving component that could otherwise be avoided is
-  likely to resul in additional unnecssary, expensive, tedious, error-prone
+  likely to result in additional unnecessary, expensive, tedious, error-prone
   work.
 
   It is usually far more effective to implement multi-stage monitoring
   mechanisms, to ensure that the core component remains simple and easy to qualify,
   while additional complexity of the validation can be achieved by chaining multiple
   monitoring processes, terminating at the bottom with a very simple one.
 
+  It is equally critical that the aspect being monitored is verified to be
+  adequately correlated to the safety goal.
+  E.g. monitoring the periodic execution of a given process doe not guarantee
+  that some of its memory has been corrupted, and that the process will not
+  result in unsafe actions.
+
+  Similarly, should the system have a goal to initialise, from a kernel driver,
+  a given hardware component, it might be necessary to introduce a requirement
+  for the device driver(s) involved, to ping the watchdog, during boot.
+
+  Safety requirements really do stem from safety goals, and this cannot be overstated.
+
 * **Watchdogs / Safety Islands / Other Exception Contexts**
 
   Continuing from the previous point, there are three typical design approaches,
@@ -223,34 +268,37 @@ and/or costly.
      following policies:
      *  NMI: the watchdog generates a Not Maskable Interrupt, which forces a core
         to execute a predefined interrupt handler. The handler runs with its own
-        sanitsied context, and can be assumed to be in a sane/safe/secure state.
+        sanitised context, and can be assumed to be in a sane/safe/secure state.
         However, it retains also the access to the system state.
 
         One could rely on the NMI interrupt service routine to take over and
         initiate a reset, or power down, or whatever is required.
      *  reset: it usually consists in the electrical state of the system being
         taken to its default values, paired with software re-initialisation.
+        Some systems might allow for partial reset, e.g. the outputs are "held",
+        while the inner logic is re-initialised.
      *  power down: power is cut off, at least from the components associated
-        to safety requirements.
+        to the (unmet) safety requirements.
 
      This is fairly boilerplate information, but it is worth mentioning that
      safety requirements apply not only to software but to hardware as well.
      The Watchdog IP must be safety qualified, to be relied on.
+     And so must be the actuator it is supposed to control, e.g. the switch
+     cutting the power.
 
-     However, for the watchdog to support safety claims, the following must
-     be proven:
-     *  both design and implementation of the WD is compatible with the
+     For the watchdog to support safety claims, the following must be proven:
+     *  both design and implementation of the watchdog is compliant with the
         system safety requirements
-     *  the WD must be adequately isolated from the rest of the system at
-        physical level, so that the chance that physical anomalies
-        (e.g. temperature, voltage, current) will reach it is in line with
+     *  the watchdog must be adequately isolated from the rest of the system
+        at physical level, so that the chance that physical anomalies
+        (e.g. temperature, voltage, current) will reach it, is in line with
         the safety requirements.
      *  the initialisation of the watchdog can be considered sufficiently safe
         (same or better safety level than the required one)
      *  it must be impossible for a single logical failure in the system to
         make it ineffective; this case covers:
         *  somehow a failure either disables or relaxes either the time
-           constraint of watchdog or action-upon-expiring
+           constraint of watchdog or its expected action-upon-expiring
         *  somehow a failure triggers one or more pings to the watchdog,
            when instead it should expire
 
@@ -276,7 +324,7 @@ and/or costly.
      typically involved in booting Linux, the bootloader.
 
      A bootloader is usually a fairly small software component that is
-     invoked directly by the boot rom (other cases are possible and
+     invoked directly by the bootrom (other cases are possible and
      discussed below). The bootloader, provided that it is at least as safe
      as the overall required safety level, could initialise the watchdog,
      and then pass the control to the Linux boot, confident that the watchdog
@@ -317,11 +365,12 @@ and/or costly.
     "Safety Island" is a fancy name for referring to what is typically a
     special core, within a System on a Chip, that can be used as foundation
     for building safety claims. Characteristics are:
-    *  Its HW design, implementation and validation are compatible with safety
-       requirements. This means that it is intentionally isolated by other
-       components in the same SoCi, to fulfil the independence requirements
-       needed when decomposing safety requirements or when protecting safety
-       workloads from the interference of other SW/HW elements.
+    *  Its hardware design, implementation and validation are compatible with
+       safety requirements. This means that it is intentionally isolated by
+       other components in the same System on a Chip, to fulfil the
+       independence requirements needed when decomposing safety requirements
+       or when protecting safety workloads from the interference of other
+       software/hardware elements.
     *  Usually is a microcontroller core
     *  Its operating parameters cannot be affected by other components in
        the system (e.g. the cores running Linux).
@@ -363,6 +412,51 @@ and/or costly.
     of the SoC watchdog toward multiple operating systems, however it
     could go beyond that, and reach higher levels of complexity.
 
+## **Hardware features are not a replacement for good design**
+
+***Against stupidity the gods themselves contend in vain.***
+
+Throwing expensive hardware features at a problem will not necessarily
+be sufficient, or even useful.
+It is always advised to understand very well what a certain feature
+might - or might not - do.
+Some examples of misplaced faith in features that have a different
+purpose from what one might think:
+
+  * ARM64 (E)PAN/PXN: these features prevent the kernel from abusing the
+    page table of a process to access memory pages with the same mappings
+    that the process would see. However, they do precisely nothing to
+    prevent the kernel from writing to those very same pages through the
+    linear map. These are security features meant to complicate the life
+    of an attacker, who would probably need to use the process page table,
+    to make any sense of their content. But a "successful" interference
+    only needs to alter the memory content. These features won't help.
+
+  * ECC memory: it helps detecting or even correcting certain classes of
+    hardware errors. Furthermore it can even hamper the execution of
+    security attacks, like the row hammer. However, it will not do anything
+    to prevent or even just detect a spatial interference coming from a
+    logical component (e.g. device driver accidentally altering the memory
+    of a process through the kernel linear map).
+
+  * Watchdog / Safety Island: they do help in monitoring that some
+    semi-periodic events will happen, as defined, however it's also
+    important to have a correct pinging strategy.
+    For example, in a multi thread application, any thread running with
+    lower priority than the watchdog pinging thread can starve undetected.
+    Should any of these threads be relevant for safety, its starvation will
+    go unnoticed to the pinging strategy.
+
+  * RealTime features: enabling full kernel preemption and realtime
+    capability is not a magic wand that substitutes understanding system
+    constraints and requirements. To use it correctly, one must appreciate
+    the fact that what it really means is to guarantee an upper bound to
+    the response time to certain events, assuming that overall priority
+    is not assigned in a way that can cause starvation, nor priority
+    inversion.
+
+
+
 ## **License: CC BY-SA 4.0**
 
 ### **DEED**