Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MI300 details to docs #446

Draft
wants to merge 22 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions docs/conceptual/definitions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ generations, but may differ in exact implementation.
In addition, these memory types *might* differ between accelerators on the same
system, even when accessing the same memory allocation.

For example, an :ref:`MI2XX <mixxx-note>` accelerator accessing *fine-grained*
For example, an :ref:`MI200 <mixxx-note>` accelerator accessing *fine-grained*
memory allocated local to that device may see the allocation as coherently
cacheable, while a remote accelerator might see the same allocation as
*uncached*.
Expand All @@ -117,15 +117,15 @@ These memory types include:

* - Uncached Memory (UC)
- Memory that will not be cached in this accelerator. On
:ref:`MI2XX <mixxx-note>` accelerators, this corresponds “fine-grained”
:ref:`MI200 <mixxx-note>` accelerators, this corresponds “fine-grained”
(or, “coherent”) memory allocated on a remote accelerator or the host,
for example, using ``hipHostMalloc`` or ``hipMallocManaged`` with default
allocation flags.

* - Non-hardware-Coherent Memory (NC)
- Memory that will be cached by the accelerator, and is only guaranteed to
be consistent at kernel boundaries / after software-driven
synchronization events. On :ref:`MI2XX <mixxx-note>` accelerators, this
synchronization events. On :ref:`MI200 <mixxx-note>` accelerators, this
type of memory maps to, for example, “coarse-grained” ``hipHostMalloc``’d
memory -- that is, allocated with the ``hipHostMallocNonCoherent``
flag -- or ``hipMalloc``’d memory allocated on a remote accelerator.
Expand All @@ -134,15 +134,15 @@ These memory types include:
- Memory for which only reads from the accelerator where the memory was
allocated will be cached. Writes to CC memory are uncached, and trigger
invalidations of any line within this accelerator. On
:ref:`MI2XX <mixxx-note>` accelerators, this type of memory maps to
:ref:`MI200 <mixxx-note>` accelerators, this type of memory maps to
“fine-grained” memory allocated on the local accelerator using, for
example, the ``hipExtMallocWithFlags`` API using the
``hipDeviceMallocFinegrained`` flag.

* - Read/Write Coherent Memory (RW)
- Memory that will be cached by the accelerator, but may be invalidated by
writes from remote devices at kernel boundaries / after software-driven
synchronization events. On :ref:`MI2XX <mixxx-note>` accelerators, this
synchronization events. On :ref:`MI200 <mixxx-note>` accelerators, this
corresponds to “coarse-grained” memory allocated locally to the
accelerator, using for example, the default ``hipMalloc`` allocator.

Expand Down
147 changes: 85 additions & 62 deletions docs/conceptual/l2-cache.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,23 @@ L2 cache (TCC)
**************

The L2 cache is the coherence point for current AMD Instinct™ MI-series GCN™
GPUs and CDNA™ accelerators, and is shared by all :doc:`CUs <compute-unit>`
on the device. Besides serving requests from the
:doc:`vector L1 data caches <vector-l1-cache>`, the L2 cache also is responsible
for servicing requests from the :ref:`L1 instruction caches <desc-l1i>`, the
:ref:`scalar L1 data caches <desc-sL1D>` and the
:doc:`command processor <command-processor>`. The L2 cache is composed of a
number of distinct channels (32 on MI100 and :ref:`MI2XX <mixxx-note>` series CDNA
accelerators at 256B address interleaving) which can largely operate
independently. Mapping of incoming requests to a specific L2 channel is
determined by a hashing mechanism that attempts to evenly distribute requests
across the L2 channels. Requests that miss in the L2 cache are passed out to
:ref:`Infinity Fabric™ <l2-fabric>` to be routed to the appropriate memory
location.
GPUs and CDNA™ accelerators, and is shared by all :doc:`CUs <compute-unit>` on
the device. Besides serving requests from the :doc:`vector L1 data caches
<vector-l1-cache>`, the L2 cache also is responsible for servicing requests
from the :ref:`L1 instruction caches <desc-l1i>`, the :ref:`scalar L1 data
caches <desc-sL1D>` and the :doc:`command processor <command-processor>`.

The L2 cache consists of several distinct channels. The CDNA3-based :ref:`MI300 <mixxx-note>`
accelerator consists of 16 channels each with a capacity of 256KB and utilizing
256B address interleaving. These channels can operate largely independently and
the system supports up to 8 instances (*one per XCD*). In constrast,
:ref:`MI200 <mixxx-note>` CDNA2 accelerators have 32 L2 cache channels each
using 256B address interleaving, and MI100 CDNA accelerators and GCN GPUs have
only 16 L2 cache channels. Incoming requests are mapped to specific L2 channels
using a hashing mechanism designed to evenly distribute the requests across the
available channels. Requests that do not find a match in the L2 cache are
forwarded to the :ref:`Infinity Fabric™ <l2-fabric>` to be routed to the
appropriate memory location. For more details, see :cdna3-white-paper:`9`.

The L2 cache metrics reported by ROCm Compute Profiler are broken down into four
categories:
Expand Down Expand Up @@ -168,7 +172,7 @@ This section details the incoming requests to the L2 cache from the

- The total number of incoming requests to the L2 that are marked as
*streaming*. The exact meaning of this may differ depending on the
targeted accelerator, however on an :ref:`MI2XX <mixxx-note>` this
targeted accelerator, however on an :ref:`MI200 <mixxx-note>` this
corresponds to
`non-temporal load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.
The L2 cache attempts to evict *streaming* requests before normal
Expand All @@ -179,7 +183,7 @@ This section details the incoming requests to the L2 cache from the
* - Probe Requests

- The number of coherence probe requests made to the L2 cache from outside
the accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be
the accelerator. On an :ref:`MI200 <mixxx-note>`, probe requests may be
generated by, for example, writes to
:ref:`fine-grained device <memory-type>` memory or by writes to
:ref:`coarse-grained <memory-type>` device memory.
Expand Down Expand Up @@ -280,12 +284,14 @@ This section details the incoming requests to the L2 cache from the

- Requests per :ref:`normalization unit <normalization-units>`.

.. _l2-cache-line-size:

.. note::

All requests to the L2 are for a single cache line's worth of data. The size
of a cache line may vary depending on the accelerator, however on an AMD
Instinct CDNA2 :ref:`MI2XX <mixxx-note>` accelerator, it is 128B, while on
an MI100, it is 64B.
of a cache line may vary depending on the accelerator. The L2 cache line
size is 128B on :ref:`MI300 and MI200 <mixxx-note>` accelerators, while on
MI100, it is 64B.

.. _l2-fabric:

Expand All @@ -308,8 +314,8 @@ individual metrics.
Request flow
------------

The following is a diagram that illustrates how L2↔Fabric requests are reported
by ROCm Compute Profiler:
The following diagram illustrates how L2↔Fabric requests are reported by ROCm
Compute Profiler:

.. figure:: ../data/performance-model/fabric.png
:align: center
Expand All @@ -318,14 +324,14 @@ by ROCm Compute Profiler:

L2↔Fabric transaction flow on AMD Instinct MI-series accelerators.


Requests from the L2 Cache are broken down into two major categories, read
requests and write requests (at this granularity, atomic requests are treated
as writes).

From there, these requests can additionally subdivided in a number of ways.
First, these requests may be sent across Infinity Fabric as different
transaction sizes, 32B or 64B on current CDNA accelerators.
transaction sizes: 32B, 64B, or 128B. Not all transaction sizes are supported
on all CDNA accelerators and GCN GPUs.

.. note::

Expand All @@ -334,15 +340,20 @@ transaction sizes, 32B or 64B on current CDNA accelerators.

In addition, the read and write requests can be further categorized as:

* Uncached read/write requests, for instance: for access to
:ref:`fine-grained memory <memory-type>`
* Uncached read/write requests: These occur, for instance, when accessing
:ref:`fine-grained memory <memory-type>`.

* Atomic requests:

* On MI300 accelerators, all atomic requests are counted as such since they
bypass the L2 cache and are routed directly to the Infinity Cache (MALL).

* Atomic requests, for instance: for atomic updates to
:ref:`fine-grained memory <memory-type>`
* On MI200 accelerators, these are requests targeted at non-write-cacheable
memory, such as :ref:`fine-grained memory <memory-type>`.

* HBM read/write requests OR remote read/write requests, for instance: for
requests to the accelerator’s local HBM OR requests to a remote accelerator’s
HBM or the CPU’s DRAM
* HBM or remote read/write requests: These are for requests to the
accelerator’s local high-bandwidth memory -- or for requests to a remote
accelerator’s HBM or the CPU’s DRAM.

These classifications are not necessarily *exclusive*. For example, a
write request can be classified as an atomic request to the
Expand All @@ -355,7 +366,6 @@ flow splits at this point:
.. figure:: ../data/performance-model/split.*
:align: center
:alt: Splitting request flow
:width: 800

Splitting request flow

Expand All @@ -365,7 +375,6 @@ uncached write request, as reflected by a non-split flow:
.. figure:: ../data/performance-model/nosplit.*
:align: center
:alt: Non-splitting request flow
:width: 800

Non-splitting request flow

Expand All @@ -379,7 +388,6 @@ counted as *two* uncached read requests (that is, the request is split):
.. figure:: ../data/performance-model/uncached.*
:align: center
:alt: Uncached read-request splitting
:width: 800

Uncached read-request splitting.

Expand All @@ -388,7 +396,7 @@ counted as *two* uncached read requests (that is, the request is split):
Metrics
-------

The following metrics are reported for the L2-Fabric interface:
The following metrics are reported for the L2-Fabric interface:

.. list-table::
:header-rows: 1
Expand Down Expand Up @@ -444,15 +452,16 @@ Metrics

* - L2-Fabric Write and Atomic Bandwidth

- The total number of bytes written by the L2 over Infinity Fabric by write
and atomic operations per
:ref:`normalization unit <normalization-units>`. Note that on current
CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are
only considered *atomic* by Infinity Fabric if they are targeted at
non-write-cacheable memory, for example,
:ref:`fine-grained memory <memory-type>` allocations or
:ref:`uncached memory <memory-type>` allocations on the
MI2XX.
- The total number of bytes written by the L2 over Infinity Fabric by
write and atomic operations per :ref:`normalization unit
<normalization-units>`. Note that on :ref:`MI200 <mixxx-note>`
accelerators, requests are only considered *atomic* by Infinity Fabric
if they are targeted at non-write-cacheable memory, for example,
:ref:`fine-grained memory <memory-type>` allocations or :ref:`uncached
memory <memory-type>` allocations on the MI200. However, on the MI300,
all atomic requests are counted as such because they are not cached in
L2 and must be directed to the Infinity Cache (MALL), regardless of the
memory type.

- Bytes per :ref:`normalization unit <normalization-units>`.

Expand All @@ -463,11 +472,13 @@ Metrics
breakdown does not consider the *size* of the request (meaning that 32B
and 64B requests are both counted as a single request), so this metric
only *approximates* the percent of the L2-Fabric Write and Atomic
bandwidth directed to the local HBM. Note that on current CDNA
accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
considered *atomic* by Infinity Fabric if they are targeted at
:ref:`fine-grained memory <memory-type>` allocations or
:ref:`uncached memory <memory-type>` allocations.
bandwidth directed to the local HBM. Note that on :ref:`MI200
<mixxx-note>` accelerators, requests are only considered *atomic* by
Infinity Fabric if they are targeted at :ref:`fine-grained memory
<memory-type>` allocations or :ref:`uncached memory <memory-type>`
allocations. However, on the MI300, all atomic requests are counted as
such because they are not cached in L2 and must be directed to the
Infinity Cache (MALL), regardless of the memory type.

- Percent

Expand All @@ -479,11 +490,13 @@ Metrics
HBM. This breakdown does not consider the *size* of the request (meaning
that 32B and 64B requests are both counted as a single request), so this
metric only *approximates* the percent of the L2-Fabric Read bandwidth
directed to a remote location. Note that on current CDNA
accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
considered *atomic* by Infinity Fabric if they are targeted at
:ref:`fine-grained memory <memory-type>` allocations or
:ref:`uncached memory <memory-type>` allocations.
directed to a remote location. Note that on :ref:`MI200 <mixxx-note>`
accelerators, requests are only considered *atomic* by Infinity Fabric
if they are targeted at :ref:`fine-grained memory <memory-type>`
allocations or :ref:`uncached memory <memory-type>` allocations.
However, on the MI300, all atomic requests are counted as such because
they are not cached in L2 and must be directed to the Infinity Cache
(MALL), regardless of the memory type.

- Percent

Expand All @@ -494,10 +507,12 @@ Metrics
*size* of the request (meaning that 32B and 64B requests are both counted
as a single request), so this metric only *approximates* the percent of
the L2-Fabric Read bandwidth directed to a remote location. Note that on
current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
requests are only considered *atomic* by Infinity Fabric if they are
targeted at :ref:`fine-grained memory <memory-type>` allocations or
:ref:`uncached memory <memory-type>` allocations.
:ref:`MI200 <mixxx-note>` accelerators, requests are only considered
*atomic* by Infinity Fabric if they are targeted at :ref:`fine-grained
memory <memory-type>` allocations or :ref:`uncached memory
<memory-type>` allocations. However, on the MI300, all atomic requests
are counted as such because they are not cached in L2 and must be
directed to the Infinity Cache (MALL), regardless of the memory type.

- Percent

Expand Down Expand Up @@ -600,6 +615,15 @@ transaction breakdown table:

- Requests per :ref:`normalization unit <normalization-units>`.

* - 128B Read Requests

- The total number of L2 requests to Infinity Fabric to read 128B of data
from any memory location, per
:ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail.

- Requests per :ref:`normalization unit <normalization-units>`.

* - HBM Read Requests

- The total number of L2 requests to Infinity Fabric to read 32B or 64B of
Expand Down Expand Up @@ -669,12 +693,11 @@ transaction breakdown table:
- The total number of L2 requests to Infinity Fabric to atomically update
32B or 64B of data in any memory location, per
:ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail. Note that on current CDNA
accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
considered *atomic* by Infinity Fabric if they are targeted at
non-write-cacheable memory, such as
:ref:`fine-grained memory <memory-type>` allocations or
:ref:`uncached memory <memory-type>` allocations on the MI2XX.
:ref:`l2-request-flow` for more detail. Note that on :ref:`MI200
<mixxx-note>` accelerators, requests are only considered *atomic* by
Infinity Fabric if they are targeted at non-write-cacheable memory, such
as :ref:`fine-grained memory <memory-type>` allocations or
:ref:`uncached memory <memory-type>` allocations on the MI200.

- Requests per :ref:`normalization unit <normalization-units>`.

Expand Down
36 changes: 24 additions & 12 deletions docs/conceptual/performance-model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@
Performance model
*****************

ROCm Compute Profiler makes available an extensive list of metrics to better understand
achieved application performance on AMD Instinct™ MI-series accelerators
including Graphics Core Next™ (GCN) GPUs like the AMD Instinct MI50, CDNA
accelerators like the MI100, and CDNA2 accelerators such as the MI250X, MI250,
and MI210.
ROCm Compute Profiler makes available an extensive list of metrics to better
understand achieved application performance on AMD Instinct™ MI-series
accelerators including CDNA™ 3 accelerators like the MI300, CDNA 2 accelerators
like the MI250X, MI250, and MI210, CDNA accelerators like the MI100, and
Graphics Core Next™ (GCN) GPUs like the MI50.

To best use profiling data, it's important to understand the role of various
hardware blocks of AMD Instinct accelerators. This section describes each
Expand All @@ -24,15 +24,27 @@ to use ROCm Compute Profiler to optimize your code.

.. note::

In this chapter, **MI2XX** refers to any of the CDNA2 architecture-based AMD
Instinct MI250X, MI250, and MI210 accelerators interchangeably in cases
where the exact product at hand is not relevant.
In this chapter, **MI300** refers to any of the CDNA3 architecture-based
Instinct MI300X and MI300A accelerators interchangeably in cases where the
exact product at hand is not relevant.

Likewise, **MI200** refers to any of the CDNA2 architecture-based AMD
Instinct MI250X, MI250, and MI210 accelerators interchangeably.

For a comparison of AMD Instinct accelerator specifications, refer to
:doc:`Hardware specifications <rocm:reference/gpu-arch-specs>`. For product
details, see the :prod-page:`MI250X <mi200/mi250x>`,
:prod-page:`MI250 <mi200/mi250>`, and :prod-page:`MI210 <mi200/mi210>`
product pages.
:doc:`Hardware specifications <rocm:reference/gpu-arch-specs>`.

Supported features
==================

.. list-table::
:header-rows: 1

* - Feature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table seems weird with just one entry right now, but I'm sure we had ideas on how to fill it :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, we probably want to (eventually) take a second pass and go through to find places where we distinguish values based on the architecture, like the waveslots discussion below (or AGPRs), and add them here.

That can probably wait till this is ~ finalized though

- Support

* - Infinity Cache
- MI300

In this chapter, the AMD Instinct performance model used by ROCm Compute Profiler is divided into a handful of
key hardware blocks, each detailed in the following sections:
Expand Down
Loading