From 9b8e59515df8e018c810b80f38bdc255f6c7662f Mon Sep 17 00:00:00 2001 From: Taylor Childers Date: Tue, 9 Aug 2022 09:27:34 -0500 Subject: [PATCH] updating introductory material --- 00_introToAlcf/00_computeSystems.md | 90 +++++++++++++++++++++-------- 1 file changed, 65 insertions(+), 25 deletions(-) diff --git a/00_introToAlcf/00_computeSystems.md b/00_introToAlcf/00_computeSystems.md index 2737ee75a..4736eb3b2 100644 --- a/00_introToAlcf/00_computeSystems.md +++ b/00_introToAlcf/00_computeSystems.md @@ -1,43 +1,83 @@ -# Argonne Leadership Computing Facility (ALCF) Overview -[Overview of ALCF](img/ALCF-AITraining2021_dist.pdf) +# What is a Supercomputer? -# Computing System Overview +Argonne hosts DOE supercomputers for use by research scientists in need of large computational resources. Supercomputers are composed of many computing _nodes_ (1 _node_ = 1 physical computer) that are connected by a high-speed communications network so that groups of nodes can share information quickly, effectively operating together as a larger computer. -[Video presentation of our System Introduction](https://www.alcf.anl.gov/support-center/training-assets/getting-started-theta) +# A Compute Node -## Theta +If you look inside your Desktop or Laptop you'll find these parts: + +![parts](img/computer-parts-diagram.png) + +A computing node of a supercomputer is very similar, each has simliar parts, but it is designed as a single unit that can be inserted and removed from large closet-sized racks with many others: + +![blade](img/computer_blade.jpg) + +In large supercomputers multiple computer processors (CPUs) and/or graphics processors (GPUs) are combined into a single node. It has a CPU on which the local operating system runs. It has local memory for running software. It may have GPUs for doing intensive calculations. Each node has a high-speed network connection that allows it to communicate with other nodes and to a large shared filesystem. + +# Cluster/HPC Computing Hardware Setup + +![Hardware](img/supercomputer_diagram.png) + +Large computer systems typically have _worker_ nodes and _login_ nodes. _login_ nodes are the nodes on which every user arrives when they login to the system. _login_ nodes should not be used for computation, but for compiling code, writing/editing code, and launching _jobs_ on the system. A _job_ is the application that will be launched on the _worker_ nodes of the supercomputer. + +# Supercomputers are Big! + +These supercomputers occupy a lot of space in the ACLF data center. Here is our staff at the time (2019) in front of [Mira](https://en.wikipedia.org/wiki/Mira_(supercomputer)), an IBM supercomputer, that debuted as the third fastest supercomputer in the world in 2012. + +![Staff-Mira](img/mira_staff.jpg) + + +# ALCF Computing System Overview + +## [Theta](https://www.alcf.anl.gov/alcf-resources/theta) ![Theta](https://www.alcf.anl.gov/sites/default/files/styles/965x543/public/2019-10/09_ALCF-Theta_111016_rgb.jpg?itok=lcvZKE6k) +The decals are stickers attached to the front of the computer _racks_ (closet sized). If you look inside a single closet you will see rows of computer _nodes_. Notice that there are repetitive rows inside the racks. These machines are made for _nodes_ to be replaced if needed. + +![Theta-nodes](img/theta1.jpg) + Theta is an 11.7-petaflops supercomputer based on Intel processors and interconnect technology, an advanced memory architecture, and a Lustre-based parallel file system, all integrated by Cray’s HPC software stack. Theta Machine Specs -* Architecture: Intel-Cray XC40 * Speed: 11.7 petaflops -* Processor per node: 64-core, 1.3-GHz Intel Xeon Phi 7230 -* Nodes: 4,392 -* Cores: 281,088 -* Memory: 843 TB (192GB / node) -* High-bandwidth memory: 70 TB (16GB / node) -* Interconnect: Aries network with Dragonfly topology -* Racks: 24 - -## ThetaGPU +* Each Node has: + * one 64-core Intel Xeon Phi (7230) CPU running at 1.3-GHz + * 192GB of DDR4 memory + * 16GB of high-bandwidth memory +* 4,392 Total nodes installed in 24 Racks +* High-speed Network Tech: Aries network with Dragonfly topology + +## [ThetaGPU](https://www.alcf.anl.gov/alcf-resources/theta) + +Inside ThetaGPU, you'll also see repetition, though NVidia placed these fancy plates over the hardware so you only see their logo. However, each plate covers 1 computer _node_. + + ThetaGPU Racks | ThetaGPU Inside + --- | --- +![ThetaGPU](img/thetagpu1.jpg) | ![ThetaGPU](img/thetagpu2.jpg) + ThetaGPU is an NVIDIA DGX A100-based system. The DGX A100 comprises eight NVIDIA A100 GPUs that provide a total of 320 gigabytes of memory for training AI datasets, as well as high-speed NVIDIA Mellanox ConnectX-6 network interfaces. ThetaGPU Machine Specs -* Architecture: NVIDIA DGX A100 * Speed: 3.9 petaflops -* Processors: AMD EPYC 7742 -* Nodes: 24 -* DDR4 Memory: 24 TB -* GPU Memory: 7,680 GB -* Racks: 7 +* Each Node has: + * 8 NVIDIA (A100) GPUs each with 40GB onboard memory + * 2 AMD EPYC (7742) CPUs + * 1 TB DDR4 Memory +* 24 Total Nodes installed in 7 Racks +## [Polaris](https://www.alcf.anl.gov/polaris) -# Cluster/HPC Computing Hardware Setup +![Polaris](img/polaris.jpg) -![Hardware](img/supercomputer_diagram.png) +The inside of Polaris again shows the _nodes_ stacked up in a closet. + +![Polaris-rack](img/polaris1.jpg) -In large supercomputers, like Theta, you combine multiple computer processors (CPUs) and/or graphics processors (GPUs) into a single _node_. A _node_ is like your desktop computer. It has a CPU on which the local operating system runs. It has local memory for running software. It may have GPUs for doing intensive calculations. Each node has a high-speed network connection that allows it to communicate with other nodes and to a large shared filesystem. +Polaris is an NVIDIA A100-based system. -Large systems typically have _worker_ nodes and _login_ nodes. _login_ nodes are the nodes on which every user arrives when they login to the system. _login_ nodes should not be used for computation, but for compiling code, writing/editing code, and launching _jobs_ on the system. A _job_ is the application that will be launched on the _worker_ nodes of the supercomputer. +Polaris Machine Specs +* Speed: 44 petaflops +* Each Node has: + * 4 NVIDIA (A100) GPUs + * 1 AMD EPYC (Milan) CPUs +* ~560 Total Nodes