Update Linux_Memory_Management_Essentials.md

Signed-off-by: Igor Stoppa <[email protected]>
elisa-tech · May 23, 2024 · 1a5d331 · 1a5d331
1 parent 34fe8ee
commit 1a5d331
Showing 1 changed file with 54 additions and 21 deletions.
diff --git a/Contributions/Linux_Memory_Management_Essentials.md b/Contributions/Linux_Memory_Management_Essentials.md
@@ -51,7 +51,7 @@ Individual points are numbered for ease of reference, but the numbering is not m
 
 ### **Kernel-space memory allocations**
 
-#### **Assertions**
+#### **Directly Verifiable Assertions**
 The following section presents a set of statements that can be objectively verified e.g. inspecting the sources.
 
 1. unlike processes memory, kernel memory pages are not swapped, nor dropped silently by the kernel itself,
@@ -62,32 +62,65 @@ The following section presents a set of statements that can be objectively verif
    (semi)contiguous (there can be holes) range. Memory within this range is both virtually and physically contiguous.
 4. physically contiguous memory is treated as a scarce resource, and typically is not provided to userspace,
    unless it explicitly asks for it (e.g. for implementing some DMA buffer)
-5. the kernel can access userspace memory in two ways:
-   1. through the userspace mapping
-      1. this is type of access is limited to few functions, like copy_to_user()/copy_from_user() and put_user()/get_user()
-      2. outside the execution of the functiona mentioned, this type of access is not possible, because the userspace mappings are made available only while executing those functions
+5. the kernel can access userspace memory in three ways:
+   1. through the userspace mapping 
+      1. this type of access is limited to few functions, like copy_to_user()/copy_from_user() and put_user()/get_user()
+      2. outside the execution of the functions mentioned, this type of access is not possible, because the userspace mappings are made available only while executing such functions
       3. it uses the userspace memory map, which can implement HW protections against kernel misuses
       4. it allows the kernel to see process pages in the same sequence and with the same mappings as the proces does
-   2. through the EL1 mappings + linear map
-      1. not the intended way to access process context, because the intended way is through userspace mappings
+   2. through a memory buffer (e.g. a memory area where the kernel regularly needs to perform large amount of read/write operations, like a network buffer)
+      1. the memory is well defined, rather than a generic area, and it is specifically reserved for this purpose
+      2. the kernel uses own mappings in EL1, while the user space uses the EL0 ones
+      3. very often the kernel doesn't actually access this area directly, but it rather configures a DMA controller to do the transfers, directly to the physical memory.
+      4. misconfiguration of these peripherals that can access physical memory directly is a potential problem - a form of mitigation relies on using IOMMUs: hardware components that the kernel can configure as firewall on the phisical memory bus, to limit what memory is accessible to each memory bus master device.
+   3. through the EL1 mappings + linear map
+      1. not the intended way for the kernel to access process context, because the intended way is one of the two previously described
       2. attempting to use EL1 mappings would be not very useful:
-         1. even for an attacker, in a security scenario (this is why EL1 mappings are allowed to have acccess to pages containing EL0 code/data)
-         2. the sequence of pages mapped in the userspace process is not known, and it can change continuously, as the memory management does its job 
-         3. some process pages might even be "missing", because they have been either swapped out or dropped
-         4. it bypasses any protection that the process might employ through its own mapping, so it would be less risky, for normal kernel operations
-6. attempt to keep pages hot in the system cache make it very difficult to predict which pages might surface in a certain place:
-   1. during various operations, it is necessary to reserve one or more pages, that later on will be released.
-   2. the kernel adopts various optimisations for attempting to keep low the fragmentation, thus allowing
-      for availability of higher-order allocations, though slabs/buddy allocator and folios.
-   3. Even the slab allocator contains queues of free chunks, grouped by size
-   4. However, it also attempts to cache locally these free pages, in per-cpu queues, with the intent of minimising
-      the amount of cache flushes that would be driven by releasing a recently feed page, and subsequently allocating
-      another one
-
+         1. the sequence of pages mapped in the userspace process is not known, and it can change continuously, as the memory management does its job
+         2. some process pages might even be "missing", because they have been either swapped out or dropped
+         3. even for an attacker, in a security scenario, because most likely the attacker would need to access user space pages sequentially, or anyway through use mappings (this is why EL1 mappings are allowed to have access to pages containing EL0 code/data)
+         4. it bypasses any protection that the process might employ through its own mapping, which, even for legitimate kernel operations on userspace, would be less safe
+6. Assertions about page "mobility", from different perspectives:
+   1. employment of physical pages - observing a physical memory page (e.g. a 4kB chunk aligned on a 4kB boundary) and how it is employed: what sort of content it might host, over time.
+      1. certain physical pages are put to use for a specific purpose, which is (almost) immutable, for the entire duration of the execution of the kernel:
+         1. kernel code from the monolithic portion (except for pages containing kernel init code, which are released after kernel init is completed)
+         2. kernel statically allocated data from the monolithic portion (except for pages containing kernel init data, which are released after kernel init is completed)
+         3. some kernel dynamically allocated data, used by the kernel itself, and never released, due to the nature of its use (e.g. object caches, persistent buffers, etc.)
+         4. memory used for loadable kernel modules (code, data) is tied to the permanence of the module in memory - this is typically stable through the entire duration of the execution. Some exceptions are modules loaded/unloaded as consequence of certian peripherals (dis)appearing, e.g. USB ones.
+      2. other physical pages (most, actually, in a typical system) are available for runtime allocation/release cycles of multiple users:
+         1. transient kernel linear memory allocations (kmalloc / get_free_pages)
+         2. transient kernel virtually linear memory allocations (vmalloc for kernel address space)
+         3. user-space memory allocations (they are always virtually linear), which are by default transient and exposed to repurposing proactively done by the kernel (see below)
+   2. transitioning of a logical page - given certain context and content, where it might be located in memory over time - and if it might be even discarded.
+      1. the kernel doesn't spontaneously repurpose the utilisation of its own physical pages; therefore it is possible to assume that the logical content of kernel allocations will reamin tied to the associated physical pages, as long as it is not intentionally altered (or subject to interference)
+         1. metadata memory used by the kernel for housekeeping purposes related to processes is included in this category; examples: task/cred structures, vmas, maple leaves, process page tables.
+      2. memory used by the kernel for the actual userspace processes: the content of this logical page determines its life cycle and expectancy: certain content such as code or constants can be "re-constructed" by relaoding it from file (code pages are likely to necessitate undergoing re-linking), so the actual logical content might disappear, over time. Other pages, on the other hand are meant to hold non-re-constructible content, such as stack, heap, variable data. These pages can, at most, be swapped out, and loaded back later on, but they cannot be simply dropped.
+      3. page cache: it is a collection of memory pages containing data read from files, over time; e.g. code or initialised data from a file that was accessed recently; in some cases the page might never have been used, but it was loaded as part of the readahaead optimisation. The life expectancy of these logical pages is heavily affected by how many processes might keep accesing them and the level of memory starvation of the system caused by other processes, with some additional complexity layered on top of this, by the use  containers.
+   3. The kernel utilises various optimisations that are meant to take advantage of hardware features, such as multi-stage caching, and also to cope with different memory architectures (like Not Unifor Memory Architecture - NUMA). The main goals are:
+      1. avoid having to propagate too frequently write operations down the cache stack, which is caused by pages being evicted from the cache, due to memory pressure
+      2. avoid having multiple cpus writing to the same page, in a NUMA system, where only one cpu has direct write access to that memory page
+      Therefore, the kernel tends to:
+      1. reuse as much as possible a certain page that has just been freed (so called hot page, since it is stll presumably present in the HW cache)
+      2. keep for each core a stash of memory pages readily available (which prevents other cores from accessing said pages and introducing additional cache-flush operations)
+   4. The MMU, involved in performing address translations, acts also as a bus master, and performs read operations whenever it needs to do an address translation that is not already present in its own local cache (TLB - Translation Lookaside Buffer). Having to perform too many of such address translations (page walks) can constitute a significant performance penalty. The TLB is not very large, and accessing lots of different memory addresses that do not belong to the same translation entry can cause severe performance degradation. This is why the kernel actually keeps most of the the memory mapped in the linear map, to take advantage of a feature present in many processors, that allows the mapping of large chunks of physical memory (e.g 2MB) as a single entry (or few ones). The kernel code, for example, is kept compact to maximise the efficiency of the fetching operations.
+   5. For what concerns allocation from the linear map (kmalloc / get_fre_pages), the kernel attempts to keep the free pages as much continuous as possible, avoiding fragmentation.
+      1. this is implemented through the concept of the "buddy allocator", meaning that whenever a certain amount of linear memory is requested (either sub-page or multi-page size), it always tries to obtain it from the smallest free slot available, only breaking larger free slots when no alternatives are possible.
+      2. the kernel keeps also ready a certain amount of pre-diced memory allocations, to avoid incurring in the penalty of having to look for soem free memory as consequence of an allocation request.
+      3. folios are structures introduced to simplify the management of what has been traditionally called compound pages: a compound page represents a group of contiguous pages that is treated as a single logical unit. Folios could eventually support the use of optimisations provided by certain pages (e.g. ARM64 allows the use of a single page table entry to represent 16 pages, as long aas they are physically contiguous and aligned to a 16 pages boundary, throguh the "contiguous bit" flag in the page table). This can be useful e.g. when keeping in the page cache a chunk of data from a file, should the memory be released, it could result in releaseing several physically contiguous pages, instead of scattered ones.
+   6. whenever possible, allocations happen through caches, which means that said caches must be filled whenever they hit a low watermark, and this can happen in two ways:
+      1. through recycling memory that happens to be freed: for example in case a core is running short of pages in its own local queue, it might "capture" a page that it is freeing.
+      2. through a dedicated thread that can asynchronously dice larger order pages into smaller portions that are then placed into chaces in need to be refilled
+   7. The kerne lcan also employ and Out Of Omemory Killer feature, that is invoked in extreme cases, when all the existing stashes of memory have been depleted: in this case the killer will pick a user space process and just evict it, rleasing all the resources it had allocated. It's far from desirable, but it's a method sometimes employed.
+   8. freeing of memory pages also happens in a deferred way, through separate threads, so that there is no overhead on the freer, in updating the metadata assocaited to the memory that has jsut been released.
+   9. All of the mechanisms described above for memory management are memory users as well, because they rely on metadata that cannot be pre-allcoatd and must be adjusted accordingly to the memory transactions happening as the systme evolves over time.
+   10. The Linux kernel provides means to limit certain requests a process might present; for example with cgroups it is possible to create memory "bubbles" that cannot grow beyond a set size, and associate processes tothem, that share the collective limit within the bubble. But this does nothing toward separating how the kernel might use the underlying memory, besides setting the cosntraint as described.
+
 #### **Safety-Oriented consideration**
 The following section considerations that are of a more deductive nature.
 
-1. placeholder
+1. Because of the way pages, fractions and multiples of them are allocated, freed, cached, recovered, there is a complex interaction between system components at various layers.
+2. Even using cgroups, it is not possible to segregate interaction at the low level between components with differnet level of safety qualification (e.g. a QM container can and most likely will affect the recirculation of pages related to an ASIL one)
+3. Because of the memory management nature, it must be expected that it can either due to a bug or due to the interference in its metadata, interfere with safe processes, e.g. handle any requesting entity a memory page that is currently already in use by a device driver or userspace process playing a role in a safety use case.
 
 ### **User-space memory allocations**