Skip to content

Commit

Permalink
updating introductory material
Browse files Browse the repository at this point in the history
  • Loading branch information
jtchilders committed Aug 9, 2022
1 parent bb07757 commit 9b8e595
Showing 1 changed file with 65 additions and 25 deletions.
90 changes: 65 additions & 25 deletions 00_introToAlcf/00_computeSystems.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,83 @@
# Argonne Leadership Computing Facility (ALCF) Overview
[Overview of ALCF](img/ALCF-AITraining2021_dist.pdf)
# What is a Supercomputer?

# Computing System Overview
Argonne hosts DOE supercomputers for use by research scientists in need of large computational resources. Supercomputers are composed of many computing _nodes_ (1 _node_ = 1 physical computer) that are connected by a high-speed communications network so that groups of nodes can share information quickly, effectively operating together as a larger computer.

[Video presentation of our System Introduction](https://www.alcf.anl.gov/support-center/training-assets/getting-started-theta)
# A Compute Node

## Theta
If you look inside your Desktop or Laptop you'll find these parts:

![parts](img/computer-parts-diagram.png)

A computing node of a supercomputer is very similar, each has simliar parts, but it is designed as a single unit that can be inserted and removed from large closet-sized racks with many others:

![blade](img/computer_blade.jpg)

In large supercomputers multiple computer processors (CPUs) and/or graphics processors (GPUs) are combined into a single node. It has a CPU on which the local operating system runs. It has local memory for running software. It may have GPUs for doing intensive calculations. Each node has a high-speed network connection that allows it to communicate with other nodes and to a large shared filesystem.

# Cluster/HPC Computing Hardware Setup

![Hardware](img/supercomputer_diagram.png)

Large computer systems typically have _worker_ nodes and _login_ nodes. _login_ nodes are the nodes on which every user arrives when they login to the system. _login_ nodes should not be used for computation, but for compiling code, writing/editing code, and launching _jobs_ on the system. A _job_ is the application that will be launched on the _worker_ nodes of the supercomputer.

# Supercomputers are Big!

These supercomputers occupy a lot of space in the ACLF data center. Here is our staff at the time (2019) in front of [Mira](https://en.wikipedia.org/wiki/Mira_(supercomputer)), an IBM supercomputer, that debuted as the third fastest supercomputer in the world in 2012.

![Staff-Mira](img/mira_staff.jpg)


# ALCF Computing System Overview

## [Theta](https://www.alcf.anl.gov/alcf-resources/theta)
![Theta](https://www.alcf.anl.gov/sites/default/files/styles/965x543/public/2019-10/09_ALCF-Theta_111016_rgb.jpg?itok=lcvZKE6k)

The decals are stickers attached to the front of the computer _racks_ (closet sized). If you look inside a single closet you will see rows of computer _nodes_. Notice that there are repetitive rows inside the racks. These machines are made for _nodes_ to be replaced if needed.

![Theta-nodes](img/theta1.jpg)

Theta is an 11.7-petaflops supercomputer based on Intel processors and interconnect technology, an advanced memory architecture, and a Lustre-based parallel file system, all integrated by Cray’s HPC software stack.

Theta Machine Specs
* Architecture: Intel-Cray XC40
* Speed: 11.7 petaflops
* Processor per node: 64-core, 1.3-GHz Intel Xeon Phi 7230
* Nodes: 4,392
* Cores: 281,088
* Memory: 843 TB (192GB / node)
* High-bandwidth memory: 70 TB (16GB / node)
* Interconnect: Aries network with Dragonfly topology
* Racks: 24

## ThetaGPU
* Each Node has:
* one 64-core Intel Xeon Phi (7230) CPU running at 1.3-GHz
* 192GB of DDR4 memory
* 16GB of high-bandwidth memory
* 4,392 Total nodes installed in 24 Racks
* High-speed Network Tech: Aries network with Dragonfly topology

## [ThetaGPU](https://www.alcf.anl.gov/alcf-resources/theta)

Inside ThetaGPU, you'll also see repetition, though NVidia placed these fancy plates over the hardware so you only see their logo. However, each plate covers 1 computer _node_.

ThetaGPU Racks | ThetaGPU Inside
--- | ---
![ThetaGPU](img/thetagpu1.jpg) | ![ThetaGPU](img/thetagpu2.jpg)

ThetaGPU is an NVIDIA DGX A100-based system. The DGX A100 comprises eight NVIDIA A100 GPUs that provide a total of 320 gigabytes of memory for training AI datasets, as well as high-speed NVIDIA Mellanox ConnectX-6 network interfaces.

ThetaGPU Machine Specs
* Architecture: NVIDIA DGX A100
* Speed: 3.9 petaflops
* Processors: AMD EPYC 7742
* Nodes: 24
* DDR4 Memory: 24 TB
* GPU Memory: 7,680 GB
* Racks: 7
* Each Node has:
* 8 NVIDIA (A100) GPUs each with 40GB onboard memory
* 2 AMD EPYC (7742) CPUs
* 1 TB DDR4 Memory
* 24 Total Nodes installed in 7 Racks

## [Polaris](https://www.alcf.anl.gov/polaris)

# Cluster/HPC Computing Hardware Setup
![Polaris](img/polaris.jpg)

![Hardware](img/supercomputer_diagram.png)
The inside of Polaris again shows the _nodes_ stacked up in a closet.

![Polaris-rack](img/polaris1.jpg)

In large supercomputers, like Theta, you combine multiple computer processors (CPUs) and/or graphics processors (GPUs) into a single _node_. A _node_ is like your desktop computer. It has a CPU on which the local operating system runs. It has local memory for running software. It may have GPUs for doing intensive calculations. Each node has a high-speed network connection that allows it to communicate with other nodes and to a large shared filesystem.
Polaris is an NVIDIA A100-based system.

Large systems typically have _worker_ nodes and _login_ nodes. _login_ nodes are the nodes on which every user arrives when they login to the system. _login_ nodes should not be used for computation, but for compiling code, writing/editing code, and launching _jobs_ on the system. A _job_ is the application that will be launched on the _worker_ nodes of the supercomputer.
Polaris Machine Specs
* Speed: 44 petaflops
* Each Node has:
* 4 NVIDIA (A100) GPUs
* 1 AMD EPYC (Milan) CPUs
* ~560 Total Nodes

0 comments on commit 9b8e595

Please sign in to comment.