-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial draft of 20-architecture episode #48
base: gh-pages
Are you sure you want to change the base?
Changes from all commits
59f6a4e
70b726a
c9e5715
85eb51c
6ceb1af
72ad59c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,34 @@ | ||
--- | ||
title: "HPC architecture" | ||
questions: | ||
- "To be filled with a question!" | ||
- "What is a high-performance computer?!" | ||
- "How are high-performance computers different from personal computers?" | ||
- "How do these differences influence how I use HPC systems most effectively?" | ||
objectives: | ||
- "Provide a general mind map of how HPC resources are connected together (head node, compute node, NAS)" | ||
keypoints: | ||
- "To be filled with a fact/sentence." | ||
- "A high-performance computer system provides a larger compute capability than is possible to package in a personal computer." | ||
- "HPC systems are typically an aggregation of a bunch computers, each one of which can look pretty similar to your personal computer." | ||
- "HPC systems are usually accessed remotely, over the network." | ||
- "HPC systems are usually shared among many users. Each user typically gets a dedicated portion of the computer's resources for a period of time." | ||
- "Special measures have to be taken to provide a file system that can keep up with an HPC system." | ||
- "HPC systems often provide a lot of different software packages, and provide ways of selecting and configuring them to get the environment you need." | ||
--- | ||
|
||
# What is a High-Performance Computer? | ||
|
||
A high-performance computer (HPC system) is a tool used by computational scientists and engineers to tackle problems that require more computing resources or time than they can obtain on the personal computers available to them. HPC systems range in size from the equivalent of just a few personal computers to tens, or even hundreds of thousands of them. They tend to be expensive to buy and operate, so they are often shared at the departmental or institutional level. There are also many regional and national HPC centers. Because of this, most HPC systems are accessed remotely, over the network. | ||
|
||
HPC systems are generally constructed from many individual computers, similar in capability to many personal computers. Each of these individual computers is often referred to as a **node**. HPC systems often include several different types of nodes, which are specialized for different purposes. **Head** (or **front-end** or **login**) nodes are where you login to interact with the computer. **Compute** nodes are where the real computing is done. **Storage** nodes provide the specialized filesystems used on HPC systems. Some HPC systems also have **service** nodes, which you don't usually interact with directly, but you will sometimes read about. These nodes are connected by a network (or interconnect), which is often designed to provide very high performance as well. | ||
|
||
<!-- It would be nice to have a diagram that showed the different types of nodes, and the network. | ||
Something like http://www.archer.ac.uk/training/course-material/2018/03/intro-hw/slides/L01_WhyHPC.pdf | ||
--> | ||
|
||
Depending on the HPC system, the compute nodes, even individually, might be much more powerful than a typical personal computer. They often have multiple processors (each with many cores), and may have accelerators (such as GPUs) and other capabilities less common on personal computers. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but what makes things go fast is usually quantity not quality. I think that should be clear. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both can be important. To the extent that the nodes are much more powerful than a personal computer, users can do much more with them -- and need to in order to use the machine effectively. Example: the Summit system now being stood up at Oak Ridge, each node has 122 CPU cores and 6 GPUs, as well as two large flash drives and 608 GB of memory. If you expect to run there only what you run on your laptop, it is a complete waste. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point! |
||
|
||
In order to share these large systems among many users, it is common to allocate subsets of the compute nodes to tasks (or **jobs**), based on requests from users. These jobs may take a long time to complete, so they come and go in time. To manage the sharing of the compute nodes among all of the jobs, HPC systems use a **batch system** or **scheduler**. The batch system usually has commands for submitting jobs, inquiring about their status, and modifying them. The HPC center defines the algorithms by which jobs are prioritized for execution on the compute nodes, while ensuring that the compute nodes are not overloaded. <!-- reference to episode 30 --> | ||
|
||
The kind of computing that people do on HPC systems often involves very large files, and/or many of them. Further, the files have to be accessible from all of the front-end and compute nodes on the system. So most HPC systems have specialized filesystems that are designed to do a better job of meeting these needs than typical network filesystems, like NFS. Frequently, these specialized filesystems are intended to be used only for short- or medium-term storage, not permanent storage. So HPC systems often have several different filesystems available -- for example **home**, and **scratch** filesystems. It can be very important to select the right filesystem to get the results you want (performance or permanence are the typical trade-offs). <!-- reference to episode 35 --> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
|
||
Because HPC systems serve many users with different software needs, HPC systems often have multiple versions of commonly used software packages installed. Since you can't easily install and use different versions of a package easily at the same time, HPC systems often use an approach called **modules**, which allows you to configure your software environment with the particular versions of software that you need. <!-- reference to episode 40 --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to introduce service/storage nodes. Too many terms!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated with myself about service nodes. Storage, I'm not yet persuaded. I can see an argument of focusing on the filesystems and not worrying about the hardware behind that. My feeling was that storage nodes might be something they would see in descriptions of machines at some facilities. I'm willing to be argued out of this feeling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda agree with @ChristinaLK...from the standpoint of HPC software use and user experience, I think the only kinds of nodes users will routinely wind up dealing with in conversation, or instructions or in reading documentation would be the compute (back-end) nodes and the login (front-end) nodes. System architectural documents are probably the only places where other kinds of nodes (storage, service, gateway, etc.) might wind up getting described and so are probably outside the scope of a novice HPC lesson.