-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial draft of 20-architecture episode #48
base: gh-pages
Are you sure you want to change the base?
Conversation
|
||
A high-performance computer (HPC system) is a tool used by computational scientists and engineers to tackle problems that require more computing resources or time than they can obtain on the personal computers available to them. HPC systems range in size from the equivalent of just a few personal computers to tens, or even hundreds of thousands of them. They tend to be expensive to buy and operate, so they are often shared at the departmental or institutional level. There are also many regional and national HPC centers. Because of this, most HPC systems are accessed remotely, over the network. | ||
|
||
HPC systems are generally constructed from many individual computers, similar in capability to many personal computers. Each of these individual computers is often referred to as a **node**. HPC systems often include several different types of nodes, which are specialized for different purposes. **Head** (or **front-end** or **login**) nodes are where you login to interact with the computer. **Compute** nodes are where the real computing is done. **Storage** nodes provide the specialized filesystems used on HPC systems. Some HPC systems also have **service** nodes, which you don't usually interact with directly, but you will sometimes read about. These nodes are connected by a network (or interconnect), which is often designed to provide very high performance as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to introduce service/storage nodes. Too many terms!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated with myself about service nodes. Storage, I'm not yet persuaded. I can see an argument of focusing on the filesystems and not worrying about the hardware behind that. My feeling was that storage nodes might be something they would see in descriptions of machines at some facilities. I'm willing to be argued out of this feeling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda agree with @ChristinaLK...from the standpoint of HPC software use and user experience, I think the only kinds of nodes users will routinely wind up dealing with in conversation, or instructions or in reading documentation would be the compute (back-end) nodes and the login (front-end) nodes. System architectural documents are probably the only places where other kinds of nodes (storage, service, gateway, etc.) might wind up getting described and so are probably outside the scope of a novice HPC lesson.
Something like http://www.archer.ac.uk/training/course-material/2018/03/intro-hw/slides/L01_WhyHPC.pdf | ||
--> | ||
|
||
Depending on the HPC system, the compute nodes, even individually, might be much more powerful than a typical personal computer. They often have multiple processors (each with many cores), and may have accelerators (such as GPUs) and other capabilities less common on personal computers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but what makes things go fast is usually quantity not quality. I think that should be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both can be important. To the extent that the nodes are much more powerful than a personal computer, users can do much more with them -- and need to in order to use the machine effectively. Example: the Summit system now being stood up at Oak Ridge, each node has 122 CPU cores and 6 GPUs, as well as two large flash drives and 608 GB of memory. If you expect to run there only what you run on your laptop, it is a complete waste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
|
||
In order to share these large systems among many users, it is common to allocate subsets of the compute nodes to tasks (or **jobs**), based on requests from users. These jobs may take a long time to complete, so they come and go in time. To manage the sharing of the compute nodes among all of the jobs, HPC systems use a **batch system** or **scheduler**. The batch system usually has commands for submitting jobs, inquiring about their status, and modifying them. The HPC center defines the algorithms by which jobs are prioritized for execution on the compute nodes, while ensuring that the compute nodes are not overloaded. <!-- reference to episode 30 --> | ||
|
||
The kind of computing that people do on HPC systems often involves very large files, and/or many of them. Further, the files have to be accessible from all of the front-end and compute nodes on the system. So most HPC systems have specialized filesystems that are designed to do a better job of meeting these needs than typical network filesystems, like NFS. Frequently, these specialized filesystems are intended to be used only for short- or medium-term storage, not permanent storage. So HPC systems often have several different filesystems available -- for example **home**, and **scratch** filesystems. It can be very important to select the right filesystem to get the results you want (performance or permanence are the typical trade-offs). <!-- reference to episode 35 --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Initial draft developed at CarpentryCon2018 by @Ianvdl, @sleak-lbl, and @bernhold. Use it in good health!