Skip to content

Commit

Permalink
Update Linux_Memory_Management_Essentials.md
Browse files Browse the repository at this point in the history
Signed-off-by: Igor Stoppa <[email protected]>
  • Loading branch information
igor-stoppa authored May 24, 2024
1 parent 7f53bcc commit 14c14ea
Showing 1 changed file with 13 additions and 13 deletions.
26 changes: 13 additions & 13 deletions Contributions/Linux_Memory_Management_Essentials.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ Plese refer to the Linux Kernel documentation.
* When referring to specific HW features, the document refers to the ARM64 architecture.

## **Purpose of the document**
This document aims to providing an holistic view of what happens in Linux memory management, so that
one is at least aware of certain features and can use this document as jumping pad toward more detailed documentation.
This document aims to provide an holistic view of what happens in Linux memory management, so that
one is at least aware of certain features and can use this document as a jumping pad toward more detailed documentation.
Or even to the code base.

## **Structure of the document**
Expand Down Expand Up @@ -72,7 +72,7 @@ The following section presents a set of statements that can be objectively verif
1. the memory is well defined, rather than a generic area, and it is specifically reserved for this purpose
2. the kernel uses own mappings in EL1, while the user space uses the EL0 ones
3. very often the kernel doesn't actually access this area directly, but it rather configures a DMA controller to do the transfers, directly to the physical memory.
4. misconfiguration of these peripherals that can access physical memory directly is a potential problem - a form of mitigation relies on using IOMMUs: hardware components that the kernel can configure as firewall on the phisical memory bus, to limit what memory is accessible to each memory bus master device.
4. misconfiguration of these peripherals that can access physical memory directly is a potential problem - a form of mitigation relies on using IOMMUs: hardware components that the kernel can configure as a firewall on the phisical memory bus, to limit what memory is accessible to each memory bus master device.
3. through the EL1 mappings + linear map
1. not the intended way for the kernel to access process context, because the intended way is one of the two previously described
2. attempting to use EL1 mappings would be not very useful:
Expand All @@ -86,37 +86,37 @@ The following section presents a set of statements that can be objectively verif
1. kernel code from the monolithic portion (except for pages containing kernel init code, which are released after kernel init is completed)
2. kernel statically allocated data from the monolithic portion (except for pages containing kernel init data, which are released after kernel init is completed)
3. some kernel dynamically allocated data, used by the kernel itself, and never released, due to the nature of its use (e.g. object caches, persistent buffers, etc.)
4. memory used for loadable kernel modules (code, data) is tied to the permanence of the module in memory - this is typically stable through the entire duration of the execution. Some exceptions are modules loaded/unloaded as consequence of certain peripherals (dis)appearing, e.g. USB ones.
4. memory used for loadable kernel modules (code, data) is tied to the permanence of the module in memory - this is typically stable through the entire duration of the execution. Some exceptions are modules loaded/unloaded as a consequence of certain peripherals (dis)appearing, e.g. USB ones.
2. other physical pages (most, actually, in a typical system) are available for runtime allocation/release cycles of multiple users:
1. transient kernel linear memory allocations (kmalloc / get_free_pages)
2. transient kernel virtually linear memory allocations (vmalloc for kernel address space)
3. user-space memory allocations (they are always virtually linear), which are by default transient and exposed to repurposing proactively done by the kernel (see below)
2. transitioning of a logical page - given certain context and content, where it might be located in memory over time - and if it might be even discarded.
1. the kernel doesn't spontaneously repurpose the utilisation of its own physical pages; therefore it is possible to assume that the logical content of kernel allocations will reamin tied to the associated physical pages, as long as it is not intentionally altered (or subject to interference)
1. metadata memory used by the kernel for housekeeping purposes related to processes is included in this category; examples: task/cred structures, vmas, maple leaves, process page tables.
2. memory used by the kernel for the actual userspace processes: the content of this logical page determines its life cycle and expectancy: certain content such as code or constants can be "re-constructed" by relaoding it from file (code pages are likely to necessitate undergoing re-linking), so the actual logical content might disappear, over time. Other pages, on the other hand are meant to hold non-re-constructible content, such as stack, heap, variable data. These pages can, at most, be swapped out, and loaded back later on, but they cannot be simply dropped.
3. page cache: it is a collection of memory pages containing data read from files, over time; e.g. code or initialised data from a file that was accessed recently; in some cases the page might never have been used, but it was loaded as part of the readahaead optimisation. The life expectancy of these logical pages is heavily affected by how many processes might keep accesing them and the level of memory starvation of the system caused by other processes, with some additional complexity layered on top of this, by the use containers.
3. The kernel utilises various optimisations that are meant to take advantage of hardware features, such as multi-stage caching, and also to cope with different memory architectures (like Not Unifor Memory Architecture - NUMA). The main goals are:
2. memory used by the kernel for the actual userspace processes: the content of this logical page determines its life cycle and expectancy: certain content such as code or constants can be "re-constructed" by relaoding it from file (code pages are likely to necessitate undergoing re-linking), so the actual logical content might disappear, over time. Other pages, on the other hand, are meant to hold non-re-constructible content, such as stack, heap, and variable data. These pages can, at most, be swapped out, and loaded back later on, but they cannot be simply dropped.
3. page cache: it is a collection of memory pages containing data read from files, over time; e.g. code or initialised data from a file that was accessed recently; in some cases the page might never have been used, but it was loaded as part of the readahaead optimisation. The life expectancy of these logical pages is heavily affected by how many processes might keep accesing them and the level of memory starvation of the system caused by other processes, with some additional complexity layered on top of this, by the use of containers.
3. The kernel utilises various optimisations that are meant to take advantage of hardware features, such as multi-stage caching, and also to cope with different memory architectures (like Non Unifor Memory Architecture - NUMA). The main goals are:
1. avoid having to propagate too frequently write operations down the cache stack, which is caused by pages being evicted from the cache, due to memory pressure
2. avoid having multiple cpus writing to the same page, in a NUMA system, where only one cpu has direct write access to that memory page
Therefore, the kernel tends to:
1. reuse as much as possible a certain page that has just been freed (so called hot page, since it is stll presumably present in the HW cache)
2. keep for each core a stash of memory pages readily available (which prevents other cores from accessing said pages and introducing additional cache-flush operations)
4. The MMU, involved in performing address translations, acts also as a bus master, and performs read operations whenever it needs to do an address translation that is not already present in its own local cache (TLB - Translation Lookaside Buffer). Having to perform too many of such address translations (page walks) can constitute a significant performance penalty. The TLB is not very large, and accessing lots of different memory addresses that do not belong to the same translation entry can cause severe performance degradation. This is why the kernel actually keeps most of the the memory mapped in the linear map, to take advantage of a feature present in many processors, that allows the mapping of large chunks of physical memory (e.g 2MB) as a single entry (or few ones). The kernel code, for example, is kept compact to maximise the efficiency of the fetching operations.
4. The MMU, involved in performing address translations, acts also as a bus master, and performs read operations whenever it needs to do an address translation that is not already present in its own local cache (TLB - Translation Lookaside Buffer). Having to perform too many of such address translations (page walks) can constitute a significant performance penalty. The TLB is not very large, and accessing lots of different memory addresses that do not belong to the same translation entry can cause severe performance degradation. This is why the kernel actually keeps most of the memory mapped in the linear map, to take advantage of a feature present in many processors, that allows the mapping of large chunks of physical memory (e.g 2MB) as a single entry (or few ones). The kernel code, for example, is kept compact to maximise the efficiency of the fetching operations.
5. For what concerns allocation from the linear map (kmalloc / get_fre_pages), the kernel attempts to keep the free pages as much continuous as possible, avoiding fragmentation.
1. this is implemented through the concept of the "buddy allocator", meaning that whenever a certain amount of linear memory is requested (either sub-page or multi-page size), it always tries to obtain it from the smallest free slot available, only breaking larger free slots when no alternatives are possible.
2. the kernel keeps also ready a certain amount of pre-diced memory allocations, to avoid incurring in the penalty of having to look for some free memory as consequence of an allocation request.
2. the kernel also keeps ready a certain amount of pre-diced memory allocations, to avoid incurring the penalty of having to look for some free memory as a consequence of an allocation request.
3. folios are structures introduced to simplify the management of what has been traditionally called compound pages: a compound page represents a group of contiguous pages that is treated as a single logical unit. Folios could eventually support the use of optimisations provided by certain pages (e.g. ARM64 allows the use of a single page table entry to represent 16 pages, as long aas they are physically contiguous and aligned to a 16 pages boundary, through the "contiguous bit" flag in the page table). This can be useful e.g. when keeping in the page cache a chunk of data from a file, should the memory be released, it could result in releaseing several physically contiguous pages, instead of scattered ones.
6. whenever possible, allocations happen through caches, which means that said caches must be filled whenever they hit a low watermark, and this can happen in two ways:
1. through recycling memory that happens to be freed: for example in case a core is running short of pages in its own local queue, it might "capture" a page that it is freeing.
2. through a dedicated thread that can asynchronously dice larger order pages into smaller portions that are then placed into chaces in need to be refilled
2. through a dedicated thread that can asynchronously dice larger order pages into smaller portions that are then placed into chances in need to be refilled
7. the kernel can also employ an Out Of Memory Killer feature, that is invoked in extreme cases, when all the existing stashes of memory have been depleted: in this case the killer will pick a user space process and just evict it, releasing all the resources it had allocated. It's far from desirable, but it's a method sometimes employed.
8. freeing of memory pages also happens in a deferred way, through separate threads, so that there is no overhead on the freer, in updating the metadata associated to the memory that has just been released.
8. freeing of memory pages also happens in a deferred way, through separate threads, so that there is no overhead on the freer, in updating the metadata associated with the memory that has just been released.
9. All of the mechanisms described above for memory management are memory users as well, because they rely on metadata that cannot be pre-allcoatd and must be adjusted accordingly to the memory transactions happening as the systme evolves over time.
10. The Linux kernel provides means to limit certain requests a process might present; for example with cgroups it is possible to create memory "bubbles" that cannot grow beyond a set size, and associate processes tothem, that share the collective limit within the bubble. But this does nothing toward separating how the kernel might use the underlying memory, besides setting the constraint as described.

#### **Safety-Oriented consideration**
The following section considerations that are of a more deductive nature.
The following considerations are of a more deductive nature.

1. Because of the way pages, fractions and multiples of them are allocated, freed, cached, recovered, there is a complex interaction between system components at various layers.
2. Even using cgroups, it is not possible to segregate interaction at the low level between components with differnet level of safety qualification (e.g. a QM container can and most likely will affect the recirculation of pages related to an ASIL one)
Expand Down Expand Up @@ -172,7 +172,7 @@ The following section presents a set of statements that can be objectively verif


#### **Safety-Oriented consideration**
The following section considerations that are of a more deductive nature.
The following considerations are of a more deductive nature.

1. a process that is supposed to support safety requirements should not have pages swapped out / dropped,
because this would introduce:
Expand Down

0 comments on commit 14c14ea

Please sign in to comment.