forked from argonne-lcf/ai-science-training-series
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
bb07757
commit 9b8e595
Showing
1 changed file
with
65 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,43 +1,83 @@ | ||
# Argonne Leadership Computing Facility (ALCF) Overview | ||
[Overview of ALCF](img/ALCF-AITraining2021_dist.pdf) | ||
# What is a Supercomputer? | ||
|
||
# Computing System Overview | ||
Argonne hosts DOE supercomputers for use by research scientists in need of large computational resources. Supercomputers are composed of many computing _nodes_ (1 _node_ = 1 physical computer) that are connected by a high-speed communications network so that groups of nodes can share information quickly, effectively operating together as a larger computer. | ||
|
||
[Video presentation of our System Introduction](https://www.alcf.anl.gov/support-center/training-assets/getting-started-theta) | ||
# A Compute Node | ||
|
||
## Theta | ||
If you look inside your Desktop or Laptop you'll find these parts: | ||
|
||
![parts](img/computer-parts-diagram.png) | ||
|
||
A computing node of a supercomputer is very similar, each has simliar parts, but it is designed as a single unit that can be inserted and removed from large closet-sized racks with many others: | ||
|
||
![blade](img/computer_blade.jpg) | ||
|
||
In large supercomputers multiple computer processors (CPUs) and/or graphics processors (GPUs) are combined into a single node. It has a CPU on which the local operating system runs. It has local memory for running software. It may have GPUs for doing intensive calculations. Each node has a high-speed network connection that allows it to communicate with other nodes and to a large shared filesystem. | ||
|
||
# Cluster/HPC Computing Hardware Setup | ||
|
||
![Hardware](img/supercomputer_diagram.png) | ||
|
||
Large computer systems typically have _worker_ nodes and _login_ nodes. _login_ nodes are the nodes on which every user arrives when they login to the system. _login_ nodes should not be used for computation, but for compiling code, writing/editing code, and launching _jobs_ on the system. A _job_ is the application that will be launched on the _worker_ nodes of the supercomputer. | ||
|
||
# Supercomputers are Big! | ||
|
||
These supercomputers occupy a lot of space in the ACLF data center. Here is our staff at the time (2019) in front of [Mira](https://en.wikipedia.org/wiki/Mira_(supercomputer)), an IBM supercomputer, that debuted as the third fastest supercomputer in the world in 2012. | ||
|
||
![Staff-Mira](img/mira_staff.jpg) | ||
|
||
|
||
# ALCF Computing System Overview | ||
|
||
## [Theta](https://www.alcf.anl.gov/alcf-resources/theta) | ||
![Theta](https://www.alcf.anl.gov/sites/default/files/styles/965x543/public/2019-10/09_ALCF-Theta_111016_rgb.jpg?itok=lcvZKE6k) | ||
|
||
The decals are stickers attached to the front of the computer _racks_ (closet sized). If you look inside a single closet you will see rows of computer _nodes_. Notice that there are repetitive rows inside the racks. These machines are made for _nodes_ to be replaced if needed. | ||
|
||
![Theta-nodes](img/theta1.jpg) | ||
|
||
Theta is an 11.7-petaflops supercomputer based on Intel processors and interconnect technology, an advanced memory architecture, and a Lustre-based parallel file system, all integrated by Cray’s HPC software stack. | ||
|
||
Theta Machine Specs | ||
* Architecture: Intel-Cray XC40 | ||
* Speed: 11.7 petaflops | ||
* Processor per node: 64-core, 1.3-GHz Intel Xeon Phi 7230 | ||
* Nodes: 4,392 | ||
* Cores: 281,088 | ||
* Memory: 843 TB (192GB / node) | ||
* High-bandwidth memory: 70 TB (16GB / node) | ||
* Interconnect: Aries network with Dragonfly topology | ||
* Racks: 24 | ||
|
||
## ThetaGPU | ||
* Each Node has: | ||
* one 64-core Intel Xeon Phi (7230) CPU running at 1.3-GHz | ||
* 192GB of DDR4 memory | ||
* 16GB of high-bandwidth memory | ||
* 4,392 Total nodes installed in 24 Racks | ||
* High-speed Network Tech: Aries network with Dragonfly topology | ||
|
||
## [ThetaGPU](https://www.alcf.anl.gov/alcf-resources/theta) | ||
|
||
Inside ThetaGPU, you'll also see repetition, though NVidia placed these fancy plates over the hardware so you only see their logo. However, each plate covers 1 computer _node_. | ||
|
||
ThetaGPU Racks | ThetaGPU Inside | ||
--- | --- | ||
![ThetaGPU](img/thetagpu1.jpg) | ![ThetaGPU](img/thetagpu2.jpg) | ||
|
||
ThetaGPU is an NVIDIA DGX A100-based system. The DGX A100 comprises eight NVIDIA A100 GPUs that provide a total of 320 gigabytes of memory for training AI datasets, as well as high-speed NVIDIA Mellanox ConnectX-6 network interfaces. | ||
|
||
ThetaGPU Machine Specs | ||
* Architecture: NVIDIA DGX A100 | ||
* Speed: 3.9 petaflops | ||
* Processors: AMD EPYC 7742 | ||
* Nodes: 24 | ||
* DDR4 Memory: 24 TB | ||
* GPU Memory: 7,680 GB | ||
* Racks: 7 | ||
* Each Node has: | ||
* 8 NVIDIA (A100) GPUs each with 40GB onboard memory | ||
* 2 AMD EPYC (7742) CPUs | ||
* 1 TB DDR4 Memory | ||
* 24 Total Nodes installed in 7 Racks | ||
|
||
## [Polaris](https://www.alcf.anl.gov/polaris) | ||
|
||
# Cluster/HPC Computing Hardware Setup | ||
![Polaris](img/polaris.jpg) | ||
|
||
![Hardware](img/supercomputer_diagram.png) | ||
The inside of Polaris again shows the _nodes_ stacked up in a closet. | ||
|
||
![Polaris-rack](img/polaris1.jpg) | ||
|
||
In large supercomputers, like Theta, you combine multiple computer processors (CPUs) and/or graphics processors (GPUs) into a single _node_. A _node_ is like your desktop computer. It has a CPU on which the local operating system runs. It has local memory for running software. It may have GPUs for doing intensive calculations. Each node has a high-speed network connection that allows it to communicate with other nodes and to a large shared filesystem. | ||
Polaris is an NVIDIA A100-based system. | ||
|
||
Large systems typically have _worker_ nodes and _login_ nodes. _login_ nodes are the nodes on which every user arrives when they login to the system. _login_ nodes should not be used for computation, but for compiling code, writing/editing code, and launching _jobs_ on the system. A _job_ is the application that will be launched on the _worker_ nodes of the supercomputer. | ||
Polaris Machine Specs | ||
* Speed: 44 petaflops | ||
* Each Node has: | ||
* 4 NVIDIA (A100) GPUs | ||
* 1 AMD EPYC (Milan) CPUs | ||
* ~560 Total Nodes |