Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: EuroCC2 Bootcamp Technical Issues #2

Open
programmah opened this issue May 10, 2024 · 0 comments
Open

Issue: EuroCC2 Bootcamp Technical Issues #2

programmah opened this issue May 10, 2024 · 0 comments

Comments

@programmah
Copy link

  1. Lab: single-gpu overview
    Output from the code differs from the example in the lab (looks like it may be off by an iteration – the lab example includes iteration 0?) e.g. for iteration 900, output is 900, 0.173963
  • example in lab is 900, 0.173818
  1. Lab: intra-node topology
    In the DGX a100 section, it has the following text:
    “If we remove the -p2p flag and and run the command again for GPUs 0 and 7, we will not get any difference in performance on DGX A100 system. As you may recall, P2P is not possible between GPUs 0 and 7, so the underlying communication path doesn't change, resulting in same performance with and without the -p2p flag. This can be confirmed by profiling the application and looking at the operations performed in the Nsight Systems timeline.”
    Two things here:
    -        double “and”: “ and and “ in the first sentence
    -        secondly (and the important one) it says that P2P is not possible for GPU 0  and 7 – this is incorrect (it is true for a DGX V100) for the DGX a100 – P2P is available between any GPU on the DGX A100 thanks to the NVSwitch

  2. Lab: CUDA streams
    Diagram of the default stream is potentially misleading – it suggests the non-default stream can execute at the same time:
    -   For the Optimization: "Notice that the copy operations take place serially after the Jacobi iteration. The kernel computation must be complete before copying the updated halos from the GPU of interest (source) to its neighbours (destination). However, we can perform the copy operation from the neighbouring GPUs (source) to the GPU of interest (destination) concurrently with the kernel computation as it will only be required in the next iteration." – a diagram for this might be helpful
    -  Also for the Implementation exercise part 4, a diagram tracking the sequence of events we are trying to create might be useful
    - There appears to be an error in the provided code to be changed:
    The final TODO has a cudaMemcpyAsync as the code to be modified - it should actually be a cudaEventRecord (the solution has correct code):
                // TODO: Part 4- Record completion of bottom halo copy from "dev_id" to its neighbour
                // to be used in next iteration. Record the event for "push_bottom_done" stream of
                // "dev_id" for next iteration which is "(iter+1) % 2"
                CUDA_RT_CALL(cudaMemcpyAsync(/Fill me/, /Fill me/, nx * sizeof(float),
                                             /Fill me/, /Fill me/));
    Should be
                CUDA_RT_CALL(cudaEventRecord(/Fill me/, /Fill me/));
    With solution:
               CUDA_RT_CALL(cudaEventRecord(push_bottom_done[((iter + 1) % 2)][dev_id],
                                             push_bottom_stream[dev_id]));

  3. Lab: Multi-node Multi-GPU programming
    The srun seemed to take a while compared to other labs

  4. Lab: MPI with cuda memcpy

This and subsequent labs with MPI show a warning message every time some MPI code is executed, often printed several times (I presume once per MPI task):

"Sorry! You were supposed to get help about …"
This can be fixed by setting the environment variables as follows:
export OPAL_PREFIX=$MPI_HOME
export PMIX_MCA_psec=^munge
 
Section "Point-to-point communication"
-        typo "differenciate" should be "differentiate"
 
Code: jacobi_memcpy_mpi.cpp
Typo: in the first TODO part 1 have "PI_CALL", should be "MPI_CALL"
Typo: in the first TODO part 1, has "ot" when it should be "to" in comment Section: OpenMPI Process Mappings
Typo: "spcified" should be "specified"(same for the solution version of the code)

Section: OpenMPI Process Mappings
Typo: "spcified" should be "specified"

  1. Lab: NCCL
    -    No “Lab objectives”
    -   Section: "Implementation Exercise":
    o   Typo "funciton" should be function
    o   Typo "Similarily" should be Similarly
    o   The synchronize device TODO is not mentioned in the list
    -   Towards end of the notebook
    o   Typo "number of rocesses" should be "number of processes"
    -   Get a lot of info printed out as part of the execution - I guess it’s the NCCL_DEBUG_INFO env variable? If so, maybe mention this in the text? E.g.:
    NCCL version 2.18.5+cuda12.2
    dgx01:3992189:3992189 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/RoCE [6]mlx5_6:1/IB [7]mlx5_7:1/IB [8]mlx5_8:1/IB [9]mlx5_9:1/IB [10]mlx5_10:1/IB [RO]; OOB ibp12s0:100.126.5.1<0>
    dgx01:3992189:3992189 [0] NCCL INFO Using network IB
    dgx01:3992189:3992189 [0] NCCL INFO comm 0x2be48c0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 commId 0x25ba213461d9c320 - Init START
    dgx01:3992189:3992189 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
    dgx01:3992189:3992189 [0] NCCL INFO Channel 00/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 01/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 02/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 03/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 04/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 05/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 06/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 07/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 08/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 09/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 10/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 11/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 12/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 13/24 :    0   1
    dgx01:3992189:3992189 [0] NCCL INFO Channel 14/24 :    0   1

  2. Lab: NVSHMEM
    -        No “Lab objectives”
    -        In section "Communication Model" - the paragraph here is word for word a copy of the paragraph used in the previous section "GPU-initiated communication"
    -        Section "Memory model" - mentions "NVSHMEMAPI" - should it be two words, i.e. "NVSHMEM API"?
    -        Section "Thread-group level communication" - Code example calls "get_block_offet" - offet should be "offset"
    -        In section "Implementation exercise": “Alternatively, you can navigate to CFD/English/C/source_code/mpi/ directory in Jupyter's file browser in the left pane. Then, click to open the jacobi_nvshmem.cu file.” – the folder is wrong, should be /nvshmem, not /mpi

@programmah programmah changed the title Issue: EuroCC Bootcamp Technical Issues Issue: EuroCC2 Bootcamp Technical Issues May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant