ROCm · peterjunpark · Sep 26, 2024 · Oct 9, 2024 · Jan 22, 2025 · Jan 22, 2025
@@ -102,7 +102,7 @@ generations, but may differ in exact implementation.
 In addition, these memory types *might* differ between accelerators on the same
 system, even when accessing the same memory allocation.
 
-For example, an :ref:`MI2XX <mixxx-note>` accelerator accessing *fine-grained*
+For example, an :ref:`MI200 <mixxx-note>` accelerator accessing *fine-grained*
 memory allocated local to that device may see the allocation as coherently
 cacheable, while a remote accelerator might see the same allocation as
 *uncached*.
@@ -117,15 +117,15 @@ These memory types include:
 
    * - Uncached Memory (UC)
      - Memory that will not be cached in this accelerator. On
-       :ref:`MI2XX <mixxx-note>` accelerators, this corresponds “fine-grained”
+       :ref:`MI200 <mixxx-note>` accelerators, this corresponds “fine-grained”
        (or, “coherent”) memory allocated on a remote accelerator or the host,
        for example, using ``hipHostMalloc`` or ``hipMallocManaged`` with default
        allocation flags.
 
    * - Non-hardware-Coherent Memory (NC)
      - Memory that will be cached by the accelerator, and is only guaranteed to
        be consistent at kernel boundaries / after software-driven
-       synchronization events. On :ref:`MI2XX <mixxx-note>` accelerators, this
+       synchronization events. On :ref:`MI200 <mixxx-note>` accelerators, this
        type of memory maps to, for example, “coarse-grained” ``hipHostMalloc``’d
        memory -- that is, allocated with the ``hipHostMallocNonCoherent``
        flag -- or ``hipMalloc``’d memory allocated on a remote accelerator.
@@ -134,15 +134,15 @@ These memory types include:
      - Memory for which only reads from the accelerator where the memory was
        allocated will be cached. Writes to CC memory are uncached, and trigger
        invalidations of any line within this accelerator. On
-       :ref:`MI2XX <mixxx-note>` accelerators, this type of memory maps to
+       :ref:`MI200 <mixxx-note>` accelerators, this type of memory maps to
        “fine-grained” memory allocated on the local accelerator using, for
        example, the ``hipExtMallocWithFlags`` API using the
        ``hipDeviceMallocFinegrained`` flag.
 
    * - Read/Write Coherent Memory (RW)
      - Memory that will be cached by the accelerator, but may be invalidated by
        writes from remote devices at kernel boundaries / after software-driven
-       synchronization events. On :ref:`MI2XX <mixxx-note>` accelerators, this
+       synchronization events. On :ref:`MI200 <mixxx-note>` accelerators, this
        corresponds to “coarse-grained” memory allocated locally to the
        accelerator, using for example, the default ``hipMalloc`` allocator.
 

@@ -7,19 +7,23 @@ L2 cache (TCC)
 **************
 
 The L2 cache is the coherence point for current AMD Instinct™ MI-series GCN™
-GPUs and CDNA™ accelerators, and is shared by all :doc:`CUs <compute-unit>`
-on the device. Besides serving requests from the
-:doc:`vector L1 data caches <vector-l1-cache>`, the L2 cache also is responsible
-for servicing requests from the :ref:`L1 instruction caches <desc-l1i>`, the
-:ref:`scalar L1 data caches <desc-sL1D>` and the
-:doc:`command processor <command-processor>`. The L2 cache is composed of a
-number of distinct channels (32 on MI100 and :ref:`MI2XX <mixxx-note>` series CDNA
-accelerators at 256B address interleaving) which can largely operate
-independently. Mapping of incoming requests to a specific L2 channel is
-determined by a hashing mechanism that attempts to evenly distribute requests
-across the L2 channels. Requests that miss in the L2 cache are passed out to
-:ref:`Infinity Fabric™ <l2-fabric>` to be routed to the appropriate memory
-location.
+GPUs and CDNA™ accelerators, and is shared by all :doc:`CUs <compute-unit>` on
+the device. Besides serving requests from the :doc:`vector L1 data caches
+<vector-l1-cache>`, the L2 cache also is responsible for servicing requests
+from the :ref:`L1 instruction caches <desc-l1i>`, the :ref:`scalar L1 data
+caches <desc-sL1D>` and the :doc:`command processor <command-processor>`. 
+
+The L2 cache consists of several distinct channels. The CDNA3-based :ref:`MI300 <mixxx-note>`
+accelerator consists of 16 channels each with a capacity of 256KB and utilizing
+256B address interleaving. These channels can operate largely independently and
+the system supports up to 8 instances (*one per XCD*). In constrast,
+:ref:`MI200 <mixxx-note>` CDNA2 accelerators have 32 L2 cache channels each
+using 256B address interleaving, and MI100 CDNA accelerators and GCN GPUs have
+only 16 L2 cache channels. Incoming requests are mapped to specific L2 channels
+using a hashing mechanism designed to evenly distribute the requests across the
+available channels. Requests that do not find a match in the L2 cache are
+forwarded to the :ref:`Infinity Fabric™ <l2-fabric>` to be routed to the
+appropriate memory location. For more details, see :cdna3-white-paper:`9`.
 
 The L2 cache metrics reported by ROCm Compute Profiler are broken down into four
 categories:
@@ -168,7 +172,7 @@ This section details the incoming requests to the L2 cache from the
 
      - The total number of incoming requests to the L2 that are marked as
        *streaming*. The exact meaning of this may differ depending on the
-       targeted accelerator, however on an :ref:`MI2XX <mixxx-note>` this
+       targeted accelerator, however on an :ref:`MI200 <mixxx-note>` this
        corresponds to
        `non-temporal load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.
        The L2 cache attempts to evict *streaming* requests before normal
@@ -179,7 +183,7 @@ This section details the incoming requests to the L2 cache from the
    * - Probe Requests
 
      - The number of coherence probe requests made to the L2 cache from outside
-       the accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be
+       the accelerator. On an :ref:`MI200 <mixxx-note>`, probe requests may be
        generated by, for example, writes to
        :ref:`fine-grained device <memory-type>` memory or by writes to
        :ref:`coarse-grained <memory-type>` device memory.
@@ -280,12 +284,14 @@ This section details the incoming requests to the L2 cache from the
 
      - Requests per :ref:`normalization unit <normalization-units>`.
 
+.. _l2-cache-line-size:
+
 .. note::
 
    All requests to the L2 are for a single cache line's worth of data. The size
-   of a cache line may vary depending on the accelerator, however on an AMD
-   Instinct CDNA2 :ref:`MI2XX <mixxx-note>` accelerator, it is 128B, while on
-   an MI100, it is 64B.
+   of a cache line may vary depending on the accelerator. The L2 cache line
+   size is 128B on :ref:`MI300 and MI200 <mixxx-note>` accelerators, while on
+   MI100, it is 64B.
 
 .. _l2-fabric:
 
@@ -308,8 +314,8 @@ individual metrics.
 Request flow
 ------------
 
-The following is a diagram that illustrates how L2↔Fabric requests are reported
-by ROCm Compute Profiler:
+The following diagram illustrates how L2↔Fabric requests are reported by ROCm
+Compute Profiler:
 
 .. figure:: ../data/performance-model/fabric.png
    :align: center
@@ -318,14 +324,14 @@ by ROCm Compute Profiler:
 
    L2↔Fabric transaction flow on AMD Instinct MI-series accelerators.
 
-
 Requests from the L2 Cache are broken down into two major categories, read
 requests and write requests (at this granularity, atomic requests are treated
 as writes).
 
 From there, these requests can additionally subdivided in a number of ways.
 First, these requests may be sent across Infinity Fabric as different
-transaction sizes, 32B or 64B on current CDNA accelerators.
+transaction sizes: 32B, 64B, or 128B. Not all transaction sizes are supported
+on all CDNA accelerators and GCN GPUs.
 
 .. note::
 
@@ -334,15 +340,20 @@ transaction sizes, 32B or 64B on current CDNA accelerators.
 
 In addition, the read and write requests can be further categorized as:
 
-* Uncached read/write requests, for instance: for access to
-  :ref:`fine-grained memory <memory-type>`
+* Uncached read/write requests: These occur, for instance, when accessing
+  :ref:`fine-grained memory <memory-type>`.
+
+* Atomic requests:
+
+  * On MI300 accelerators, all atomic requests are counted as such since they
+    bypass the L2 cache and are routed directly to the Infinity Cache (MALL).
 
-* Atomic requests, for instance: for atomic updates to
-  :ref:`fine-grained memory <memory-type>`
+  * On MI200 accelerators, these are requests targeted at non-write-cacheable
+    memory, such as :ref:`fine-grained memory <memory-type>`.
 
-* HBM read/write requests OR remote read/write requests, for instance: for
-  requests to the accelerator’s local HBM OR requests to a remote accelerator’s
-  HBM or the CPU’s DRAM
+* HBM or remote read/write requests: These are for requests to the
+  accelerator’s local high-bandwidth memory -- or for requests to a remote
+  accelerator’s HBM or the CPU’s DRAM.
 
 These classifications are not necessarily *exclusive*. For example, a
 write request can be classified as an atomic request to the
@@ -355,7 +366,6 @@ flow splits at this point:
 .. figure:: ../data/performance-model/split.*
    :align: center
    :alt: Splitting request flow
-   :width: 800
 
    Splitting request flow
 
@@ -365,7 +375,6 @@ uncached write request, as reflected by a non-split flow:
 .. figure:: ../data/performance-model/nosplit.*
    :align: center
    :alt: Non-splitting request flow
-   :width: 800
 
    Non-splitting request flow
 
@@ -379,7 +388,6 @@ counted as *two* uncached read requests (that is, the request is split):
 .. figure:: ../data/performance-model/uncached.*
    :align: center
    :alt: Uncached read-request splitting
-   :width: 800
 
    Uncached read-request splitting.
 
@@ -388,7 +396,7 @@ counted as *two* uncached read requests (that is, the request is split):
 Metrics
 -------
 
- The following metrics are reported for the L2-Fabric interface:
+The following metrics are reported for the L2-Fabric interface:
 
 .. list-table::
    :header-rows: 1
@@ -444,15 +452,16 @@ Metrics
 
    * - L2-Fabric Write and Atomic Bandwidth
 
-     - The total number of bytes written by the L2 over Infinity Fabric by write
-       and atomic operations per
-       :ref:`normalization unit <normalization-units>`. Note that on current
-       CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are
-       only considered *atomic* by Infinity Fabric if they are targeted at
-       non-write-cacheable memory, for example,
-       :ref:`fine-grained memory <memory-type>` allocations or
-       :ref:`uncached memory <memory-type>` allocations on the
-       MI2XX.
+     - The total number of bytes written by the L2 over Infinity Fabric by
+       write and atomic operations per :ref:`normalization unit
+       <normalization-units>`. Note that on :ref:`MI200 <mixxx-note>`
+       accelerators, requests are only considered *atomic* by Infinity Fabric
+       if they are targeted at non-write-cacheable memory, for example,
+       :ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
+       memory <memory-type>` allocations on the MI200. However, on the MI300,
+       all atomic requests are counted as such because they are not cached in
+       L2 and must be directed to the Infinity Cache (MALL), regardless of the
+       memory type.
 
      - Bytes per :ref:`normalization unit <normalization-units>`.
 
@@ -463,11 +472,13 @@ Metrics
        breakdown does not consider the *size* of the request (meaning that 32B
        and 64B requests are both counted as a single request), so this metric
        only *approximates* the percent of the L2-Fabric Write and Atomic
-       bandwidth directed to the local HBM. Note that on current CDNA
-       accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
-       considered *atomic* by Infinity Fabric if they are targeted at
-       :ref:`fine-grained memory <memory-type>` allocations or
-       :ref:`uncached memory <memory-type>` allocations.
+       bandwidth directed to the local HBM. Note that on :ref:`MI200
+       <mixxx-note>` accelerators, requests are only considered *atomic* by
+       Infinity Fabric if they are targeted at :ref:`fine-grained memory
+       <memory-type>` allocations or :ref:`uncached memory <memory-type>`
+       allocations. However, on the MI300, all atomic requests are counted as
+       such because they are not cached in L2 and must be directed to the
+       Infinity Cache (MALL), regardless of the memory type.
 
      - Percent
 
@@ -479,11 +490,13 @@ Metrics
        HBM. This breakdown does not consider the *size* of the request (meaning
        that 32B and 64B requests are both counted as a single request), so this
        metric only *approximates* the percent of the L2-Fabric Read bandwidth
-       directed to a remote location. Note that on current CDNA
-       accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
-       considered *atomic* by Infinity Fabric if they are targeted at
-       :ref:`fine-grained memory <memory-type>` allocations or
-       :ref:`uncached memory <memory-type>` allocations.
+       directed to a remote location. Note that on :ref:`MI200 <mixxx-note>`
+       accelerators, requests are only considered *atomic* by Infinity Fabric
+       if they are targeted at :ref:`fine-grained memory <memory-type>`
+       allocations or :ref:`uncached memory <memory-type>` allocations.
+       However, on the MI300, all atomic requests are counted as such because
+       they are not cached in L2 and must be directed to the Infinity Cache
+       (MALL), regardless of the memory type.
 
      - Percent
 
@@ -494,10 +507,12 @@ Metrics
        *size* of the request (meaning that 32B and 64B requests are both counted
        as a single request), so this metric only *approximates* the percent of
        the L2-Fabric Read bandwidth directed to a remote location. Note that on
-       current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
-       requests are only considered *atomic* by Infinity Fabric if they are
-       targeted at :ref:`fine-grained memory <memory-type>` allocations or
-       :ref:`uncached memory <memory-type>` allocations.
+       :ref:`MI200 <mixxx-note>` accelerators, requests are only considered
+       *atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained
+       memory <memory-type>` allocations or :ref:`uncached memory
+       <memory-type>` allocations. However, on the MI300, all atomic requests
+       are counted as such because they are not cached in L2 and must be
+       directed to the Infinity Cache (MALL), regardless of the memory type.
 
      - Percent
 
@@ -600,6 +615,15 @@ transaction breakdown table:
 
      - Requests per :ref:`normalization unit <normalization-units>`.
 
+   * - 128B Read Requests
+
+     - The total number of L2 requests to Infinity Fabric to read 128B of data
+       from any memory location, per
+       :ref:`normalization unit <normalization-units>`. See
+       :ref:`l2-request-flow` for more detail.
+
+     - Requests per :ref:`normalization unit <normalization-units>`.
+
    * - HBM Read Requests
 
      - The total number of L2 requests to Infinity Fabric to read 32B or 64B of
@@ -669,12 +693,11 @@ transaction breakdown table:
      - The total number of L2 requests to Infinity Fabric to atomically update
        32B or 64B of data in any memory location, per
        :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail. Note that on current CDNA
-       accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
-       considered *atomic* by Infinity Fabric if they are targeted at
-       non-write-cacheable memory, such as
-       :ref:`fine-grained memory <memory-type>` allocations or
-       :ref:`uncached memory <memory-type>` allocations on the MI2XX.
+       :ref:`l2-request-flow` for more detail. Note that on :ref:`MI200
+       <mixxx-note>` accelerators, requests are only considered *atomic* by
+       Infinity Fabric if they are targeted at non-write-cacheable memory, such
+       as :ref:`fine-grained memory <memory-type>` allocations or
+       :ref:`uncached memory <memory-type>` allocations on the MI200.
 
      - Requests per :ref:`normalization unit <normalization-units>`.
 

@@ -7,11 +7,11 @@
 Performance model
 *****************
 
-ROCm Compute Profiler makes available an extensive list of metrics to better understand
-achieved application performance on AMD Instinct™ MI-series accelerators
-including Graphics Core Next™ (GCN) GPUs like the AMD Instinct MI50, CDNA™
-accelerators like the MI100, and CDNA2 accelerators such as the MI250X, MI250,
-and MI210.
+ROCm Compute Profiler makes available an extensive list of metrics to better
+understand achieved application performance on AMD Instinct™ MI-series
+accelerators including CDNA™ 3 accelerators like the MI300, CDNA 2 accelerators
+like the MI250X, MI250, and MI210, CDNA accelerators like the MI100, and
+Graphics Core Next™ (GCN) GPUs like the MI50.
 
 To best use profiling data, it's important to understand the role of various
 hardware blocks of AMD Instinct accelerators. This section describes each
@@ -24,15 +24,27 @@ to use ROCm Compute Profiler to optimize your code.
 
 .. note::
 
-   In this chapter, **MI2XX** refers to any of the CDNA2 architecture-based AMD
-   Instinct MI250X, MI250, and MI210 accelerators interchangeably in cases
-   where the exact product at hand is not relevant.
+   In this chapter, **MI300** refers to any of the CDNA3 architecture-based
+   Instinct MI300X and MI300A accelerators interchangeably in cases where the
+   exact product at hand is not relevant.
+
+   Likewise, **MI200** refers to any of the CDNA2 architecture-based AMD
+   Instinct MI250X, MI250, and MI210 accelerators interchangeably.
 
    For a comparison of AMD Instinct accelerator specifications, refer to
-   :doc:`Hardware specifications <rocm:reference/gpu-arch-specs>`. For product
-   details, see the :prod-page:`MI250X <mi200/mi250x>`,
-   :prod-page:`MI250 <mi200/mi250>`, and :prod-page:`MI210 <mi200/mi210>`
-   product pages.
+   :doc:`Hardware specifications <rocm:reference/gpu-arch-specs>`.
+
+Supported features
+==================
+
+.. list-table::
+   :header-rows: 1
+
+   * - Feature
+     - Support
+
+   * - Infinity Cache
+     - MI300
 
 In this chapter, the AMD Instinct performance model used by ROCm Compute Profiler is divided into a handful of
 key hardware blocks, each detailed in the following sections: