diff --git a/01-access-machines.md b/01-access-machines.md index 92a9439..02bbdc8 100644 --- a/01-access-machines.md +++ b/01-access-machines.md @@ -1,22 +1,20 @@ --- -author: Alexandre Strube // Sabrina Benassou -title: Accessing the machines, intro +author: Alexandre Strube // Sabrina Benassou // Javad Kasravi +title: Bringing Deep Learning Workloads to JSC supercomputers course # subtitle: A primer in supercomputers` -date: June 25, 2024 +date: November 19, 2024 --- ## Communication: Links for the complimentary parts of this course: -- [Zoom](https://go.fzj.de/bringing-dl-workloads-to-jsc-zoom) -- [Slack](https://go.fzj.de/bringing-dl-workloads-to-jsc-slack) -- [JSC Training Page](https://go.fzj.de/bringing-dl-workloads-to-jsc-course) -- [Judoor project page invite](https://go.fzj.de/bringing-dl-workloads-to-jsc-project-join) -- [This document: https://go.fzj.de/bringing-dl-workloads-to-jsc](https://go.fzj.de/bringing-dl-workloads-to-jsc) +- [Event page](https://go.fzj.de/dl-in-neuroscience-course) +- [Judoor project page invite](https://go.fzj.de/dl-in-neuroscience-project-join) +- [This document: https://go.fzj.de/dl-in-neuroscience](https://go.fzj.de/dl-in-neuroscience) - Our mailing list for [AI news](https://lists.fz-juelich.de/mailman/listinfo/ml) -- [Survey at the end of the course](https://go.fzj.de/bringing-dl-workloads-to-jsc-survey) +- [Survey at the end of the course](https://go.fzj.de/dl-in-neuroscience-survey) - [Virtual Environment template](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template) -- [SOURCE of the course/slides on Github](https://go.fzj.de/bringing-dl-workloads-to-jsc-repo) +- [SOURCE of the course/slides on Github](https://go.fzj.de/dl-in-neuroscience-repo) ![](images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg) @@ -44,34 +42,38 @@ Links for the complimentary parts of this course: :::: {.col} ![Sabrina Benassou](pics/sabrina.jpg) :::: +:::: {.col} +![Javad Kasravi](pics/javad.jpg) +:::: + ::: ![](images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg) --- -### Schedule for day 1 +### Schedule | Time | Title | | ------------- | ----------- | -| 10:00 - 10:15 | Welcome | -| 10:15 - 11:00 | Introduction | -| 11:00 - 11:15 | Coffee break | -| 11:16 - 11:30 | Judoor, Keys | -| 11:30 - 12:00 | SSH, Jupyter, VS Code | +| 09:00 - 09:15 | Welcome | +| 09:15 - 10:00 | Introduction | +| 11:00 - 10:15 | Coffee break | +| 10:16 - 10:30 | Judoor, Keys | +| 10:30 - 11:00 | Jupyter-JSC | +| 11:00 - 11:15 | Coffee Break | +| 11:15 - 12:00 | Running services on the login and compute nodes | | 12:00 - 12:15 | Coffee Break | -| 12:15 - 13:00 | Running services on the login and compute nodes | -| 13:00 - 13:15 | Coffee Break | -| 13:30 - 14:00 | Sync (everyone should be at the same point) | +| 12:30 - 13:00 | Sync (everyone should be at the same point) | --- ### Note Please open this document on your own browser! We will need it for the exercises. -[https://go.fzj.de/bringing-dl-workloads-to-jsc](https://go.fzj.de/bringing-dl-workloads-to-jsc) +[https://go.fzj.de/dl-in-neuroscience](https://go.fzj.de/dl-in-neuroscience) -![Mobile friendly, but you need it on your computer, really](images/bringing-dl-workloads-to-jsc.png) +![Mobile friendly, but you need it on your computer, really](images/dl-in-neuroscience.png) --- @@ -228,12 +230,12 @@ Please open this document on your own browser! We will need it for the exercises ### Connecting to Jureca DC #### Getting compute time -- Go to [https://go.fzj.de/bringing-dl-workloads-to-jsc-project-join](https://go.fzj.de/bringing-dl-workloads-to-jsc-project-join) -- Join the course project `training2425` +- Go to [https://go.fzj.de/dl-in-neuroscience-project-join](https://go.fzj.de/dl-in-neuroscience-project-join) +- Join the course project `training2441` - Sign the Usage Agreements ([Video](https://drive.google.com/file/d/1mEN1GmWyGFp75uMIi4d6Tpek2NC_X8eY/view)) - Compute time allocation is based on compute projects. For every compute job, a compute project pays. -- Time is measured in core-hours. One hour of Jureca DC is 48 core-hours. -- Example: Job runs for 8 hours on 64 nodes of Jureca DC: 8 * 64 * 48 = 24576 core-h! +- Time is measured in core-hours. One hour of Jureca DC is 128 core-hours. +- Example: Job runs for 8 hours on 64 nodes of Jureca DC: 8 * 64 * 128 = 65536 core-h! --- @@ -250,277 +252,32 @@ Please open this document on your own browser! We will need it for the exercises ## Jupyter -#### Pay attention to the partition - DON'T RUN IT ON THE LOGIN NODE!!! ![](images/jupyter-partition.png) --- -## Connecting to Jureca DC - ---- - -## VSCode - -- [Download VScode: code.visualstudio.com](https://code.visualstudio.com/download) -- Install and run it - - On the local terminal, type `code` -- Install [Remote Development Tools](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack) -- Install [Remote: SSH](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) -- If you have Windows, you need WSL as explained on the email. - ---- - -## VSCode - -### Now with the remote explorer tab -![](images/vscode-welcome.png) - - ---- - -#### SSH -- SSH is a secure shell (terminal) connection to another computer -- You connect from your computer to the LOGIN NODE -- Security is given by public/private keys -- A connection to the supercomputer needs a - 1. Key, - 2. Configuration - 3. Key/IP address known to the supercomputer - ---- - -### SSH - -#### Create key in VSCode's Terminal (menu View->Terminal) - -```bash -mkdir ~/.ssh/ -ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSC -``` - -```bash -$ ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSC -Generating public/private ed25519 key pair. -Enter passphrase (empty for no passphrase): -Enter same passphrase again: -Your identification has been saved in /Users/strube1/.ssh/id_ed25519-JSC -Your public key has been saved in /Users/strube1/.ssh/id_ed25519-JSC.pub -The key fingerprint is: -SHA256:EGNNC1NTaN8fHwpfuZRPa50qXHmGcQjxp0JuU0ZA86U strube1@Strube-16 -The keys randomart image is: -+--[ED25519 256]--+ -| *++oo=o. . | -| . =+o .= o | -| .... o.E..o| -| . +.+o+B.| -| S =o.o+B| -| . o*.B+| -| . . = | -| o . | -| . | -+----[SHA256]-----+ -``` - ---- - -### SSH - -#### Configure SSH session - -```bash -code $HOME/.ssh/config -``` - -Windows users, from Ubuntu WSL -(Change username for your user on windows) - -```bash -ls -la /mnt/c/Users/ -mkdir /mnt/c/Users/USERNAME/.ssh/ -cp $HOME/.ssh/* /mnt/c/Users/USERNAME/.ssh/ -``` - - ---- - -### SSH - -#### Configure SSH session - -```bash -Host jureca - HostName jureca.fz-juelich.de - User [MY_USERNAME] # Here goes your username, not the word MY_USERNAME. - AddressFamily inet - IdentityFile ~/.ssh/id_ed25519-JSC - MACs hmac-sha2-512-etm@openssh.com -``` - -Copy contents to the config file and save it - -**REPLACE [MY_USERNAME] WITH YOUR USERNAME!!! 🤦‍♂️** - ---- - -### SSH - -#### JSC restricts from where you can login -#### So we need to: -1. Find our ip range -2. Add the range and key to [Judoor](https://judoor.fz-juelich.de) - ---- - -### SSH - -#### Find your ip/name range - -Open **[https://www.whatismyip.com](https://www.whatismyip.com)** - ---- - -### SSH - -#### Find your ip/name range - -![](images/whatismyip.png) - -- Let's keep this inside vscode: `code key.txt` and paste the number you got - ---- - -### SSH - -Did everyone get their **own** ip address? - ---- - -### SSH - EXAMPLE - -- I will use the number `93.199.55.163` -- **YOUR NUMBER IS DIFFERENT** -- Seriously - ---- - -### SSH - Example: `93.199.55.163` - -- Go to VSCode and make it simpler, replace the 2nd half with `"0.0/16"`: - - It was `93.199.55.163` - - Becomes `93.199.0.0/16` (with YOUR number, not with the example) -- Add a `from=""` around it -- So, it looks like this, now: `from="93.199.0.0/16"` -- Add a second magic number, with a comma: `,10.0.0.0/8` 🧙‍♀️ -- I promise, the magic is worth it 🧝‍♂️ (If time allows) -- In the end it looks like this: `from="93.199.0.0/16,10.0.0.0/8"` 🎬 -- Keep it open, we will use it later -- If you are from FZJ, also add "134.94.0.0/16" with a comma - ---- - -### SSH - Example: `93.199.0.0/16` - -#### Copy your ssh key -- Terminal: `code ~/.ssh/id_ed25519-JSC.pub` -- Something like this will open: - -- ```bash -ssh-ed25519 AAAAC3NzaC1lZDE1NTA4AAAAIHaoOJF3gqXd7CV6wncoob0DL2OJNfvjgnHLKEniHV6F strube@demonstration.fz-juelich.de -``` - -- Paste this line at the same `key.txt` which you just opened - ---- - -### SSH - -#### Example: `93.199.0.0/16` - -- Put them together and copy again: -- ```bash -from="93.199.0.0/16,10.0.0.0/8" ssh-ed25519 AAAAC3NzaC1lZDE1NTA4AAAAIHaoOJF3gqXd7CV6wncoob0DL2OJNfvjgnHLKEniHV6F strube@demonstration.fz-juelich.de -``` - ---- - -### SSH - -- Let's add it on [Judoor](https://judoor.fz-juelich.de) -- ![](images/manage-ssh-keys.png) -- Do it for JURECA and JUDAC with the same key - ---- - -### SSH - -#### Add new key to [Judoor](https://judoor.fz-juelich.de) - -![](images/manage-ssh-keys-from-and-key.png){ width=850px } - -This might take some minutes - ---- - -### SSH: Exercise - -That's it! Give it a try (and answer yes) +## Working with the supercomputer's software -```bash -$ ssh jureca -The authenticity of host 'jrlogin03.fz-juelich.de (134.94.0.185)' cannot be established. -ED25519 key fingerprint is SHA256:ASeu9MJbkFx3kL1FWrysz6+paaznGenChgEkUW8nRQU. -This key is not known by any other names -Are you sure you want to continue connecting (yes/no/[fingerprint])? Yes -************************************************************************** -* Welcome to Jureca DC * -************************************************************************** -... -... -strube1@jrlogin03~ $ -``` +- We have literally thousands of software packages, hand-compiled for the specifics of the supercomputer. +- [Full list](https://www.fz-juelich.de/en/ias/jsc/services/user-support/using-systems/software) +- [Detailed documentation](https://apps.fz-juelich.de/jsc/hps/jureca/software-modules.html) --- -### SSH: Exercise -#### Make sure you are connected to the supercomputer - -```bash -# Create a folder for myself -mkdir $PROJECT_training2425/$USER - -# Create a shortcut for the project on the home folder -rm -rf ~/course ; ln -s $PROJECT_training2425/$USER ~/course - -# Enter course folder and -cd ~/course - -# Where am I? -pwd +## Luncher in Jupyter-JSC +![](images/launcher-jupyter-jsc.png) -# We well need those later -mkdir ~/course/.cache -mkdir ~/course/.config -mkdir ~/course/.fastai -rm -rf $HOME/.cache ; ln -s ~/course/.cache $HOME/ -rm -rf $HOME/.config ; ln -s ~/course/.config $HOME/ -rm -rf $HOME/.fastai ; ln -s ~/course/.fastai $HOME/ -``` - ---- +## Software -## Working with the supercomputer's software +### Connect to terminal -- We have literally thousands of software packages, hand-compiled for the specifics of the supercomputer. -- [Full list](https://www.fz-juelich.de/en/ias/jsc/services/user-support/using-systems/software) -- [Detailed documentation](https://apps.fz-juelich.de/jsc/hps/jureca/software-modules.html) +![](images/jupyter-terminal.png) --- -## Software - -#### Tool for finding software: `module spider` +### Tool for finding software: `module spider` ```bash strube1$ module spider PyTorch @@ -644,49 +401,25 @@ The following modules match your search criteria: "toml" ``` --- -## VSCode -#### Editing files on the supercomputers +### How to run it on the login node -![](images/vscode-remotes.png) +#### create a python file +![](images/open-new-file-jp.png) --- -## VSCode - -![](images/vscode-jusuf.png) +#### create a python file +![](images/rename-matrix-python-file.png) --- -## VSCode - -- You can have a terminal inside VSCode: - - Go to the menu View->Terminal - ---- - -## VSCode - -- From the VSCode's terminal, navigate to your "course" folder and to the name you created earlier. - -- ```bash -cd $HOME/course/ -pwd -``` - -- This is out working directory. We do everything here. +#### create an python file +![](images/open-editor-matrix-python.png) --- -### Demo code -#### Create a new file "`matrix.py`" on VSCode on Jureca DC - -```bash -code matrix.py -``` - -Paste this into the file: - -``` {.python .number-lines} +#### create a python file +``` {.bash .number-lines} import torch matrix1 = torch.randn(3,3) @@ -701,8 +434,12 @@ print("The result is:\n", result) --- -### How to run it on the login node +#### create a python file +![](images/create-python-file.png) +--- + +#### Run code in login node ``` module load Stages/2023 module load GCC OpenMPI PyTorch @@ -738,11 +475,11 @@ Simple Linux Utility for Resource Management ### Slurm submission file example -`code jureca-matrix.sbatch` +Create a file named `jureca-matrix.sbatch` as described in the previous section, and copy all the content from the following into this file. ``` {.bash .number-lines} #!/bin/bash -#SBATCH --account=training2425 # Who pays? +#SBATCH --account=training2441 # Who pays? #SBATCH --nodes=1 # How many compute nodes #SBATCH --job-name=matrix-multiplication #SBATCH --ntasks-per-node=1 # How many mpi processes/node @@ -751,7 +488,7 @@ Simple Linux Utility for Resource Management #SBATCH --error=error.%j #SBATCH --time=00:01:00 # For how long can it run? #SBATCH --partition=dc-gpu # Machine partition -#SBATCH --reservation=training2425 # For today only +#SBATCH --reservation=training2441 # For today only module load Stages/2024 module load GCC OpenMPI PyTorch # Load the correct modules on the compute node(s) @@ -800,7 +537,7 @@ squeue --me ### Reservations - Some partitions have reservations, which means that only certain users can use them at certain times. -- For this course, it's called `training2425` +- For this course, it's called `training2441` --- @@ -816,13 +553,7 @@ scancel #### By now you should have output and error log files on your directory. Check them! -```bash -# Notice that this number is the job id. It's different for every job -cat output.412169 -cat error.412169 -``` - -Or simply open it on VSCode! +simply open `output.412169` and `error.412169` using Editor!! --- @@ -932,7 +663,7 @@ code fastai.sbatch ```bash #!/bin/bash -#SBATCH --account=training2425 +#SBATCH --account=training2441 #SBATCH --mail-user=MYUSER@fz-juelich.de #SBATCH --mail-type=ALL #SBATCH --nodes=1 @@ -943,7 +674,7 @@ code fastai.sbatch #SBATCH --error=error.%j #SBATCH --time=00:20:00 #SBATCH --partition=dc-gpu -#SBATCH --reservation=training2425 # For today only +#SBATCH --reservation=training2441 # For today only cd $HOME/course/ source sc_venv_template/activate.sh # Now we finally use the fastai module @@ -996,7 +727,7 @@ The following modules were not unloaded: - If you run it longer, you will get the actual error: - ```python Traceback (most recent call last): - File "/p/project/training2425/strube1/cats.py", line 5, in + File "/p/project/training2441/strube1/cats.py", line 5, in path = untar_data(URLs.PETS)/'images' ... ... @@ -1159,7 +890,7 @@ A tunnel which exposes the supercomputer's port 3000 as port 1234 locally](image --- -## Port forwarding demo: + ### Tensorboard on Jureca DC @@ -1196,164 +927,3 @@ As of now, I expect you managed to: ## ANY QUESTIONS?? #### Feedback is more than welcome! - ---- - -### Helmholtz Blablador - -![](images/blablador.png) - ---- - -### Blablador - -- Blablador is our Large Language Model inference server (eg. ChatGPT) -- It's a service for the Helmholtz Association. - - It's fast, free and PRIVATE - I don't record your conversations! -- Anyone here can use it - ---- - -### Blablador - -![https://helmholtz-blablador.fz-juelich.de](images/blablador-qrcode.png){width=500px} - ---- - -## VScode + Continue.dev - -![](images/continue-ask-code.png) - ---- - -### Obtaining a token - -- Go to helmholtz codebase at [http://codebase.helmholtz.cloud](http://codebase.helmholtz.cloud) -- Log in with your email -- On the left side, click on your profile, and then on "Preferences" -- On "Access tokens", click "Add new token", - - give it a name, - - put an expiration date (max 1 year) - - and choose "api" in the "scopes" section -- Click "Create Personal Access Token" - - You will see a "............................." - copy this and save somewhere. - ---- - -### Blablador - -![](images/blablador-api-scope.png){width=800px} - ---- - -### Blablador on VSCode! - -- Add [continue.dev](https://marketplace.visualstudio.com/items?itemName=Continue.continue) extension to VSCode -- On Continue, choose to add model, choose Other OpenAI-compatible API -- Click in Open Config.json at the end - ---- - -## Blablador: VScode + Continue.dev - -- Inside config.json, add at the `"models"` section: - -- ```json - { - "title": "Mistral helmholtz", - "provider": "openai", - "contextLength": 16384, - "model": "alias-code", - "apiKey": "ADD-YOUR-TOKEN-HERE", - "apiBase": "https://helmholtz-blablador.fz-juelich.de:8000" - }, -``` - -- REPLACE THE APIKEY WITH YOUR OWN TOKEN!!!! - ---- - -### Blablador on VSCode - -- Click on the "Continue.dev extension on the left side of VSCode. -- Select some code from our exercises, select it and send it to continue with cmd-shift-L (or ctrl-shift-L) -- Ask it to add unit tests, for example. - ---- - -## Backup slides - ---- - -## There's more! - -- Remember the magic? 🧙‍♂️ -- Let's use it now to access the compute nodes directly! - ---- - -## Proxy Jump - -#### Accessing compute nodes directly - -- If we need to access some ports on the compute nodes -- ![](images/proxyjump-magic.svg) - ---- - -## Proxy Jump - SSH Configuration - -Type on your machine "`code $HOME/.ssh/config`" and paste this at the end: - -```ssh - -# -- Compute Nodes -- -Host *.jureca - User [ADD YOUR USERNAME HERE] - StrictHostKeyChecking no - IdentityFile ~/.ssh/id_ed25519-JSC - ProxyJump jureca -``` - ---- - -## Proxy Jump: Connecting to a node - -Example: A service provides web interface on port 9999 - -On the supercomputer: - -```bash -srun --time=00:05:00 \ - --nodes=1 --ntasks=1 \ - --partition=dc-gpu \ - --account training2425 \ - --cpu_bind=none \ - --pty /bin/bash -i - -bash-4.4$ hostname # This is running on a compute node of the supercomputer -jwb0002 - -bash-4.4$ cd $HOME/course/ -bash-4.4$ source sc_venv_template/activate.sh -bash-4.4$ tensorboard --logdir=runs --port=9999 serve -``` - ---- - -## Proxy Jump - -On your machine: - -- ```bash -ssh -L :3334:localhost:9999 jrc002i.jureca -``` - -- Mind the `i` letter I added at the end of the hostname - -- Now you can access the service on your local browser at [http://localhost:3334](http://localhost:3334) - ---- - -### Now that's really the end! 😓 - diff --git a/03-parallelize-training.md b/02-parallelize-training.md similarity index 80% rename from 03-parallelize-training.md rename to 02-parallelize-training.md index 5a65bc3..77b7672 100644 --- a/03-parallelize-training.md +++ b/02-parallelize-training.md @@ -1,8 +1,36 @@ --- -author: Alexandre Strube // Sabrina Benassou +author: Alexandre Strube // Sabrina Benassou // Javad Kasravi title: Bringing Deep Learning Workloads to JSC supercomputers subtitle: Parallelize Training -date: June 25, 2024 +date: November 19, 2024 + +--- + +## Good practice + +- Always store your code in the project folder. In our case +- ```bash +/p/project/training2441/$USER +``` + +- Store data in the scratch directory for faster I/O access. Files in scratch are deleted after 90 days of inactivity. +- ```bash +/p/scratch/training2441/$USER +``` + +- Store the data in `$DATA_dataset` for a more permanent location.This location is not accessible by compute nodes. +You have to Join the [project](https://judoor.fz-juelich.de/projects/datasets/) in order to store and access data + + +--- + +## We need to download some code + +```bash +cd $HOME/course +git clone https://github.com/HelmholtzAI-FZJ/2024-11-course-deep-learning-in-neuroscience +``` + --- ## The ResNet50 Model @@ -10,6 +38,17 @@ date: June 25, 2024 --- + +## The ImageNet dataset +#### Large Scale Visual Recognition Challenge (ILSVRC) +- An image dataset organized according to the [WordNet hierarchy](https://wordnet.princeton.edu). +- Extensively used in algorithms for object detection and image classification at large scale. +- It has 1000 classes, that comprises 1.2 million images for training, and 50,000 images for the validation set. + +![](images/imagenet_banner.jpeg) + +--- + ## ImageNet class ```python @@ -20,8 +59,8 @@ class ImageNet(Dataset): self.root = root - with open(os.path.join(root, "imagenet_{}.json".format(split)), "rb") as f: - data = json.load(f) + with open(os.path.join(root, "imagenet_{}.pk".format(split)), "rb") as f: + data = pickle.load(f) self.samples = list(data.keys()) self.targets = list(data.values()) @@ -74,7 +113,8 @@ class ImageNetDataModule(pl.LightningDataModule): class resnet50Model(pl.LightningModule): def __init__(self): super().__init__() - self.model = resnet50(pretrained=True) + weights = ResNet50_Weights.DEFAULT + self.model = resnet50(weights=weights) def forward(self, x): return self.model(x) @@ -103,7 +143,7 @@ transform = transforms.Compose([ ]) # 1. Organize the data -datamodule = ImageNetDataModule("/p/scratch/training2425/data/", 256, \ +datamodule = ImageNetDataModule("/p/scratch/training2441/", 256, \ int(os.getenv('SLURM_CPUS_PER_TASK')), transform) # 2. Build the model using desired Task model = resnet50Model() @@ -124,13 +164,13 @@ trainer.save_checkpoint("image_classification_model.pt") #SBATCH --nodes=1 #SBATCH --gres=gpu:1 #SBATCH --ntasks-per-node=1 -#SBATCH --cpus-per-task=96 +#SBATCH --cpus-per-task=128 #SBATCH --time=06:00:00 #SBATCH --partition=dc-gpu -#SBATCH --account=training2425 +#SBATCH --account=training2441 #SBATCH --output=%j.out #SBATCH --error=%j.err -#SBATCH --reservation=training2425 +#SBATCH --reservation=training2441 # To get number of cpu per task export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK" @@ -152,10 +192,69 @@ real 342m11.864s ## But what about many GPUs? +::: {.container} +:::: {.col} + + + + + +- We make use of the GPU of our supercomputer and distribute our training to make training faster. - It's when things get interesting +:::: +:::: {.col} +![](images/GPUs.svg) +:::: +::: + +--- + +## Distributed Training + + +- Parallelize the training across multiple nodes, +- Significantly enhancing training speed and model accuracy. +- It is particularly beneficial for large models and computationally intensive tasks, such as deep learning.[[1]](https://pytorch.org/tutorials/distributed/home.html) + + +--- + + + + + + + +--- + + + + +## Parallel Training with PyTorch DDP + +- [PyTorch's DDP (Distributed Data Parallel)](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html) works as follows: + - Each GPU across each node gets its own process. + - Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset. + - Each process inits the model. + - Each process performs a full forward and backward pass in parallel. + - The gradients are synced and averaged across all processes. + - Each process updates its optimizer. + +--- + ## Data Parallel ![](images/data-parallel.svg) @@ -183,13 +282,13 @@ real 342m11.864s #SBATCH --nodes=1 #SBATCH --gres=gpu:4 # Use the 4 GPUs available #SBATCH --ntasks-per-node=4 # When using pl it should always be set to 4 -#SBATCH --cpus-per-task=24 # Divide the number of cpus (96) by the number of GPUs (4) +#SBATCH --cpus-per-task=32 # Divide the number of cpus (128) by the number of GPUs (4) #SBATCH --time=02:00:00 #SBATCH --partition=dc-gpu -#SBATCH --account=training2425 +#SBATCH --account=training2441 #SBATCH --output=%j.out #SBATCH --error=%j.err -#SBATCH --reservation=training2425 +#SBATCH --reservation=training2441 export CUDA_VISIBLE_DEVICES=0,1,2,3 # Very important to make the GPUs visible export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK" @@ -236,6 +335,163 @@ real 89m15.923s --- +## DDP steps + +1. Set up the environement variables for the distributed mode (WORLD_SIZE, RANK, LOCAL_RANK ...) + +- ```python +# The number of total processes started by Slurm. +ntasks = os.getenv('SLURM_NTASKS') +# Index of the current process. +rank = os.getenv('SLURM_PROCID') +# Index of the current process on this node only. +local_rank = os.getenv('SLURM_LOCALID') +# The number of nodes +nnodes = os.getenv("SLURM_NNODES") +``` + +--- + +## DDP steps + +2. Initialize a sampler to specify the sequence of indices/keys used in data loading. +3. Implements data parallelism of the model. +4. Allow only one process to save checkpoints. + +- ```python +datamodule = ImageNetDataModule("/p/scratch/training2441/", 256, \ + int(os.getenv('SLURM_CPUS_PER_TASK')), transform) +trainer = pl.Trainer(max_epochs=10, accelerator="gpu", num_nodes=nnodes) +trainer.fit(model, datamodule=datamodule) +trainer.save_checkpoint("image_classification_model.pt") +``` + +--- + +## Multi-Node training + +```python +transform = transforms.Compose([ + transforms.ToTensor(), + transforms.Resize((256, 256)) +]) + +# 1. The number of nodes +nnodes = os.getenv("SLURM_NNODES") +# 2. Organize the data +datamodule = ImageNetDataModule("/p/scratch/training2441/", 128, \ + int(os.getenv('SLURM_CPUS_PER_TASK')), transform) +# 3. Build the model using desired Task +model = resnet50Model() +# 4. Create the trainer +trainer = pl.Trainer(max_epochs=10, accelerator="gpu", num_nodes=nnodes) +# 5. Train the model +trainer.fit(model, datamodule=datamodule) +# 6. Save the model! +trainer.save_checkpoint("image_classification_model.pt") +``` + +--- + +## Multi-Node training + +16 nodes and 4 GPU each + +```bash +#!/bin/bash -x +#SBATCH --nodes=16 # This needs to match Trainer(num_nodes=...) +#SBATCH --gres=gpu:4 # Use the 4 GPUs available +#SBATCH --ntasks-per-node=4 # When using pl it should always be set to 4 +#SBATCH --cpus-per-task=32 # Divide the number of cpus (128) by the number of GPUs (4) +#SBATCH --time=00:15:00 +#SBATCH --partition=dc-gpu +#SBATCH --account=training2441 +#SBATCH --output=%j.out +#SBATCH --error=%j.err +#SBATCH --reservation=training2441 + +export CUDA_VISIBLE_DEVICES=0,1,2,3 # Very important to make the GPUs visible +export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK" + +source $HOME/course/$USER/sc_venv_template/activate.sh +time srun python3 ddp_training.py +``` + +```bash +real 6m56.457s +``` + +--- + +## Multi-Node training + +With 4 nodes: + +```bash +real 24m48.169s +``` + +With 8 nodes: + +```bash +real 13m10.722s +``` + +With 16 nodes: + +```bash +real 6m56.457s +``` + +With 32 nodes: + +```bash +real 4m48.313s +``` +--- + +## Data Parallel + + + +- It was +- ```python +trainer = pl.Trainer(max_epochs=10, accelerator="gpu") +``` +- Became +- ```python +nnodes = os.getenv("SLURM_NNODES") +trainer = pl.Trainer(max_epochs=10, accelerator="gpu", num_nodes=nnodes) +``` + +--- + +## Data Parallel + + + +- It was +- ```bash +#SBATCH --nodes=1 +#SBATCH --gres=gpu:1 +#SBATCH --ntasks-per-node=1 +#SBATCH --cpus-per-task=128 +``` +- Became +- ```bash +#SBATCH --nodes=16 # This needs to match Trainer(num_nodes=...) +#SBATCH --gres=gpu:4 # Use the 4 GPUs available +#SBATCH --ntasks-per-node=4 # When using pl it should always be set to 4 +#SBATCH --cpus-per-task=32 # Divide the number of cpus (128) by the number of GPUs (4) +export CUDA_VISIBLE_DEVICES=0,1,2,3 # Very important to make the GPUs visible +``` + +--- + +## DEMO + +--- + ## Before we go further... - Data parallel is usually good enough 👌 @@ -431,187 +687,6 @@ real 89m15.923s --- - -## Parallel Training with PyTorch DDP - -- [PyTorch's DDP (Distributed Data Parallel)](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html) works as follows: - - Each GPU across each node gets its own process. - - Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset. - - Each process inits the model. - - Each process performs a full forward and backward pass in parallel. - - The gradients are synced and averaged across all processes. - - Each process updates its optimizer. - ---- - - -## Terminologies - -- WORLD_SIZE: number of processes participating in the job. -- RANK: the rank of the process in the network. -- LOCAL_RANK: the rank of the process on the local machine. -- MASTER_PORT: free port on machine with rank 0. - - ---- - -## DDP steps - -1. Set up the environement variables for the distributed mode (WORLD_SIZE, RANK, LOCAL_RANK ...) - -- ```python -# The number of total processes started by Slurm. -ntasks = os.getenv('SLURM_NTASKS') -# Index of the current process. -rank = os.getenv('SLURM_PROCID') -# Index of the current process on this node only. -local_rank = os.getenv('SLURM_LOCALID') -# The number of nodes -nnodes = os.getenv("SLURM_NNODES") -``` - ---- - -## DDP steps - -2. Initialize a sampler to specify the sequence of indices/keys used in data loading. -3. Implements data parallelism of the model. -4. Allow only one process to save checkpoints. - -- ```python -datamodule = ImageNetDataModule("/p/scratch/training2425/data/", 256, \ - int(os.getenv('SLURM_CPUS_PER_TASK')), transform) -trainer = pl.Trainer(max_epochs=10, accelerator="gpu", num_nodes=nnodes) -trainer.fit(model, datamodule=datamodule) -trainer.save_checkpoint("image_classification_model.pt") -``` - ---- - -## DDP steps - -```python -transform = transforms.Compose([ - transforms.ToTensor(), - transforms.Resize((256, 256)) -]) - -# 1. The number of nodes -nnodes = os.getenv("SLURM_NNODES") -# 2. Organize the data -datamodule = ImageNetDataModule("/p/scratch/training2425/data/", 128, \ - int(os.getenv('SLURM_CPUS_PER_TASK')), transform) -# 3. Build the model using desired Task -model = resnet50Model() -# 4. Create the trainer -trainer = pl.Trainer(max_epochs=10, accelerator="gpu", num_nodes=nnodes) -# 5. Train the model -trainer.fit(model, datamodule=datamodule) -# 6. Save the model! -trainer.save_checkpoint("image_classification_model.pt") -``` - ---- - -## DDP training - -16 nodes and 4 GPU each - -```bash -#!/bin/bash -x -#SBATCH --nodes=16 # This needs to match Trainer(num_nodes=...) -#SBATCH --gres=gpu:4 # Use the 4 GPUs available -#SBATCH --ntasks-per-node=4 # When using pl it should always be set to 4 -#SBATCH --cpus-per-task=24 # Divide the number of cpus (96) by the number of GPUs (4) -#SBATCH --time=00:15:00 -#SBATCH --partition=dc-gpu -#SBATCH --account=training2425 -#SBATCH --output=%j.out -#SBATCH --error=%j.err -#SBATCH --reservation=training2425 - -export CUDA_VISIBLE_DEVICES=0,1,2,3 # Very important to make the GPUs visible -export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK" - -source $HOME/course/$USER/sc_venv_template/activate.sh -time srun python3 ddp_training.py -``` - -```bash -real 6m56.457s -``` - ---- - -## DDP training - -With 4 nodes: - -```bash -real 24m48.169s -``` - -With 8 nodes: - -```bash -real 13m10.722s -``` - -With 16 nodes: - -```bash -real 6m56.457s -``` - -With 32 nodes: - -```bash -real 4m48.313s -``` ---- - -## Data Parallel - - - -- It was -- ```python -trainer = pl.Trainer(max_epochs=10, accelerator="gpu") -``` -- Became -- ```python -nnodes = os.getenv("SLURM_NNODES") -trainer = pl.Trainer(max_epochs=10, accelerator="gpu", num_nodes=nnodes) -``` - ---- - -## Data Parallel - - - -- It was -- ```bash -#SBATCH --nodes=1 -#SBATCH --gres=gpu:1 -#SBATCH --ntasks-per-node=1 -#SBATCH --cpus-per-task=96 -``` -- Became -- ```bash -#SBATCH --nodes=16 # This needs to match Trainer(num_nodes=...) -#SBATCH --gres=gpu:4 # Use the 4 GPUs available -#SBATCH --ntasks-per-node=4 # When using pl it should always be set to 4 -#SBATCH --cpus-per-task=24 # Divide the number of cpus (96) by the number of GPUs (4) -export CUDA_VISIBLE_DEVICES=0,1,2,3 # Very important to make the GPUs visible -``` - ---- - -## DEMO - ---- - ## TensorBoard - In resnet50.py @@ -645,7 +720,6 @@ tensorboard --logdir=[PATH_TO_TENSOR_BOARD] ## DAY 2 RECAP -- Access using FS, Arrow, and H5 files - Ran parallel code. - Can submit single node, multi-gpu and multi-node training. - Use TensorBoard on the supercomputer. @@ -657,7 +731,7 @@ tensorboard --logdir=[PATH_TO_TENSOR_BOARD] #### Feedback is more than welcome! -#### Link to [other courses at JSC](https://go.fzj.de/intro-sc-ai-2023-other-courses) +#### Link to [other courses at JSC](https://go.fzj.de/dl-in-neuroscience-all-courses) --- diff --git a/02-speedup-data-loading.md b/02-speedup-data-loading.md deleted file mode 100644 index 8d9d79b..0000000 --- a/02-speedup-data-loading.md +++ /dev/null @@ -1,444 +0,0 @@ ---- -author: Alexandre Strube // Sabrina Benassou -title: Bringing Deep Learning Workloads to JSC supercomputers -subtitle: Data loading -date: June 25, 2024 ---- - -### Schedule for day 2 - -| Time | Title | -| ------------- | ----------- | -| 10:00 - 10:15 | Welcome, questions | -| 10:15 - 11:30 | Data loading | -| 11:30 - 12:00 | Coffee Break (flexible) | -| 12:30 - 14:00 | Parallelize Training | - ---- - -## Let's talk about DATA - -- Some general considerations one should have in mind - ---- - -![Not this data](images/data-and-lore.jpg) - ---- - -## I/O is separate and shared - -#### All compute nodes of all supercomputers see the same files - -- Performance tradeoff between shared acessibility and speed -- It's simple to load data fast to 1 or 2 gpus. But to 100? 1000? 10000? - ---- - -### Jülich Supercomputers - -- Our I/O server is almost a supercomputer by itself -- ![JSC Supercomputer Stragegy](images/machines.png) - ---- - -## Where do I keep my files? - -- **`$PROJECT_projectname`** for code (`projectname` is `training2425` in this case) - - Most of your work should stay here -- **`$DATA_projectname`** for big data(*) - - Permanent location for big datasets -- **`$SCRATCH_projectname`** for temporary files (fast, but not permanent) - - Files are deleted after 90 days untouched - ---- - -## Data services - -- JSC provides different data services -- Data projects give massive amounts of storage -- We use it for ML datasets. Join the project at **[Judoor](https://judoor.fz-juelich.de/projects/join/datasets)** -- After being approved, connect to the supercomputer and try it: -- ```bash -cd $DATA_datasets -ls -la -``` - ---- - -## Data Staging - -- [LARGEDATA filesystem](https://apps.fz-juelich.de/jsc/hps/juwels/filesystems.html) is not accessible by compute nodes - - Copy files to an accessible filesystem BEFORE working -- Imagenet-21K copy alone takes 21+ minutes to $SCRATCH - - We already copied it to $SCRATCH for you - ---- - -## Data loading - -![Fat GPUs need to be fed FAST](images/nomnom.jpg) - ---- - -## Strategies - -- We have CPUs and lots of memory - let's use them - - multitask training and data loading for the next batch - - `/dev/shm` is a filesystem on ram - ultra fast ⚡️ -- Use big files made for parallel computing - - HDF5, Zarr, mmap() in a parallel fs, LMDB -- Use specialized data loading libraries - - FFCV, DALI, Apache Arrow -- Compression sush as squashfs - - data transfer can be slower than decompression (must be checked case by case) - - Beneficial in cases where numerous small files are at hand. - ---- - -## Libraries - -- Apache Arrow [https://arrow.apache.org/](https://arrow.apache.org/) -- FFCV [https://github.com/libffcv/ffcv](https://github.com/libffcv/ffcv) and [FFCV for PyTorch-Lightning](https://github.com/SerezD/ffcv_pytorch_lightning) -- Nvidia's DALI [https://developer.nvidia.com/dali](https://developer.nvidia.com/dali) - ---- - -## We need to download some code - -```bash -cd $HOME/course -git clone https://github.com/HelmholtzAI-FZJ/2024-06-course-Bringing-Deep-Learning-Workloads-to-JSC-supercomputers.git -``` - ---- - -## The ImageNet dataset -#### Large Scale Visual Recognition Challenge (ILSVRC) -- An image dataset organized according to the [WordNet hierarchy](https://wordnet.princeton.edu). -- Extensively used in algorithms for object detection and image classification at large scale. -- It has 1000 classes, that comprises 1.2 million images for training, and 50,000 images for the validation set. - -![](images/imagenet_banner.jpeg) - ---- - -## The ImageNet dataset - -```bash -ILSVRC -|-- Data/ - `-- CLS-LOC - |-- test - |-- train - | |-- n01440764 - | | |-- n01440764_10026.JPEG - | | |-- n01440764_10027.JPEG - | | |-- n01440764_10029.JPEG - | |-- n01695060 - | | |-- n01695060_10009.JPEG - | | |-- n01695060_10022.JPEG - | | |-- n01695060_10028.JPEG - | | |-- ... - | |... - |-- val - |-- ILSVRC2012_val_00000001.JPEG - |-- ILSVRC2012_val_00016668.JPEG - |-- ILSVRC2012_val_00033335.JPEG - |-- ... -``` ---- - -## The ImageNet dataset -imagenet_train.json - -```bash -{ - 'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_8050.JPEG': 524, - 'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_12728.JPEG': 524, - 'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_9736.JPEG': 524, - ... - 'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_7460.JPEG': 524, - ... - } -``` - -imagenet_val.json - -```bash -{ - 'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00008838.JPEG': 785, - 'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00008555.JPEG': 129, - 'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00028410.JPEG': 968, - ... - 'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00016007.JPEG': 709, - } -``` - ---- - -## Access File System - -```python -def __getitem__(self, idx): - x = Image.open(os.path.join(self.root, self.samples[idx])).convert("RGB") - if self.transform: - x = self.transform(x) - return x, self.targets[idx] - -``` - ---- - -## Inodes -- Inodes (Index Nodes) are data structures that store metadata about files and directories. -- Unique identification of files and directories within the file system. -- Efficient management and retrieval of file metadata. -- Essential for file operations like opening, reading, and writing. -- **Limitations**: - - **Fixed Number**: Limited number of inodes; no new files if exhausted, even with free disk space. - - **Space Consumption**: Inodes consume disk space, balancing is needed for efficiency. -![](images/inodes.png) - ---- - -## Pyarrow File Creation - -![](images/field.png) - -```python - binary_t = pa.binary() - uint16_t = pa.uint16() -``` - ---- - -## Pyarrow File Creation - -![](images/schema.png) - -```python - binary_t = pa.binary() - uint16_t = pa.uint16() - - schema = pa.schema([ - pa.field('image_data', binary_t), - pa.field('label', uint16_t), - ]) -``` - ---- - -## Pyarrow File Creation - -![](images/file.png){width=700 height=350} - -```python - with pa.OSFile( - os.path.join(args.target_folder, f'ImageNet-{split}.arrow'), - 'wb', - ) as f: - with pa.ipc.new_file(f, schema) as writer: -``` - ---- - -## Pyarrow File Creation - -![](images/batch.png){width=650 height=300} - -```python - - with open(sample, 'rb') as f: - img_string = f.read() - - image_data = pa.array([img_string], type=binary_t) - label = pa.array([label], type=uint16_t) - - batch = pa.record_batch([image_data, label], schema=schema) - - writer.write(batch) -``` - ---- - -## Pyarrow File Creation - -![](images/pyarrow.png){width=650 height=300} - -```python - - with open(sample, 'rb') as f: - img_string = f.read() - - image_data = pa.array([img_string], type=binary_t) - label = pa.array([label], type=uint16_t) - - batch = pa.record_batch([image_data, label], schema=schema) - - writer.write(batch) -``` - ---- - -## Access Arrow File - -::: {.container} -:::: {.col} -![](images/pyarrow.png){width=500 height=300} -:::: -:::: {.col} -```python -def __getitem__(self, idx): - if self.arrowfile is None: - self.arrowfile = pa.OSFile(self.data_root, 'rb') - self.reader = pa.ipc.open_file(self.arrowfile) - - row = self.reader.get_batch(idx) - - img_string = row['image_data'][0].as_py() - target = row['label'][0].as_py() - - with io.BytesIO(img_string) as byte_stream: - with Image.open(byte_stream) as img: - img = img.convert("RGB") - - if self.transform: - img = self.transform(img) - - return img, target - -``` -:::: -::: - ---- - -## HDF5 - -![](images/h5.png) - -```python - -with h5py.File(os.path.join(args.target_folder, 'ImageNet.h5'), "w") as f: - -``` - ---- - -## HDF5 - -::: {.container} -:::: {.col} -```python - -group = g.create_group(split) - -``` -:::: -:::: {.col} -![](images/groups.png) -:::: -::: - ---- - -## HDF5 - - -::: {.container} -:::: {.col} -``` python -dt_sample = h5py.vlen_dtype(np.dtype(np.uint8)) -dt_target = np.dtype('int16') - -dset = group.create_dataset( - 'images', - (len(samples),), - dtype=dt_sample, - ) - -dtargets = group.create_dataset( - 'targets', - (len(samples),), - dtype=dt_target, - ) -``` -:::: -:::: {.col} -![](images/datasets.png){width=400 height=350} -:::: -::: - ---- - -## HDF5 - - -![](images/first_iter.png){width=750 height=350} - -```python -for idx, (sample, target) in tqdm(enumerate(zip(samples, targets))): - with open(sample, 'rb') as f: - img_string = f.read() - dset[idx] = np.array(list(img_string), dtype=np.uint8) - dtargets[idx] = target -``` - ---- - -## HDF5 - - -![](images/last_iter.png){width=750 height=350} - -```python -for idx, (sample, target) in tqdm(enumerate(zip(samples, targets))): - with open(sample, 'rb') as f: - img_string = f.read() - dset[idx] = np.array(list(img_string), dtype=np.uint8) - dtargets[idx] = target -``` - ---- - -## HDF5 - - -![](images/hdf5.png) - ---- - -## Access h5 File - -```python -def __getitem__(self, idx): - if self.h5file is None: - self.h5file = h5py.File(self.train_data_path, 'r')[self.split] - self.imgs = self.h5file["images"] - self.targets = self.h5file["targets"] - - img_string = self.imgs[idx] - target = self.targets[idx] - - with io.BytesIO(img_string) as byte_stream: - with Image.open(byte_stream) as img: - img = img.convert("RGB") - - if self.transform: - img = self.transform(img) - - return img, target -``` - ---- - -## DEMO - ---- - -## Exercise - -- Could you create an arrow file for the flickr dataset stored in -```/p/scratch/training2402/data/Flickr30K/``` -and read it using a dataloader ? \ No newline at end of file diff --git a/email-template.md b/email-template.md index c515cb8..b498c13 100644 --- a/email-template.md +++ b/email-template.md @@ -1,20 +1,16 @@ --- -author: Alexandre Strube // Sabrina Benassou -title: Course: Bringing Deep Learning Workloads to JSC supercomputers +author: Alexandre Strube // Sabrina Benassou // Javad Kasravi +title: Deep Learning in Neuroscience // on the Supercomputers of the Jülich Supercomputing Centre # subtitle: A primer in supercomputers` -date: June 25, 2024 +date: November 19, 2024 --- Dear students, the next "Bringing Deep Learning Workloads to JSC supercomputers" course is approaching! Thank you all very much for your participation. -The course is online, over zoom. It might be recorded. This is the link: -https://go.fzj.de/bringing-dl-workloads-to-jsc-zoom - - ********* -IMPORTANT - Please check all steps! Some things need to be done a day BEFORE the course!!! +IMPORTANT - Please check all steps! Some things need to be done some days BEFORE the course!!! ********* Checklist for BEFORE the course: @@ -22,43 +18,29 @@ Checklist for BEFORE the course: - If you don't have one, make an account on JuDOOR, our portal: https://judoor.fz-juelich.de/register Instruction video: https://drive.google.com/file/d/1-DfiNBP4Gta0av4lQmubkXIXzr2FW4a-/view -- Joining the course's project: https://go.fzj.de/bringing-dl-workloads-to-jsc-project-join +- Joining the course's project: https://go.fzj.de/dl-in-neuroscience-project-join - Sign the usage agreements, as shown in this video: https://drive.google.com/file/d/1mEN1GmWyGFp75uMIi4d6Tpek2NC_X8eY/view -- Install software (see below). On windows you DO need administrator rights. We can't support other softwares during the course. - -- We will use Slack for communication. Please log in BEFORE the course: https://go.fzj.de/bringing-dl-workloads-to-jsc-slack - - +If you did not complete the above checklist before the course, unfortunately, it will not be possible to use the supercomputers. --- What software is necessary for this course? The course is platform-independent. It can even be followed by a Windows user, but if possible, avoid it. In general. Forever. -- Visual Studio Code: it's a free editor which we will demo on this course. Get it from https://code.visualstudio.com/download - -- Visual Studio Code Remote Development: https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack - -- Visual Studio: Remote - SSH: https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh - -- (WINDOWS ONLY): WSL. This installs the WSL support for Visual Studio Code, which will install WSL itself (And Ubuntu). https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-wsl - This is a long install, take your time. - PLEASE MAKE SURE WSL IS ACTUALLY INSTALLED - Try running it. Check this example: https://pureinfotech.com/install-windows-subsystem-linux-2-windows-10/ - - A terminal. On Linux and Mac, it's just called "Terminal". Little familiarity with it is required. On windows, the WSL installs it. -- The `ssh` command. It's installed by default on Mac and Linux, and should be on Windows after the aforementioned steps. - - Some knowledge of the Python language. --- -The course material is available at https://go.fzj.de/bringing-dl-workloads-to-jsc - I will be making some final commits to it, so make sure you reload it every now and then. + +The course material is available at https://go.fzj.de/dl-in-neuroscience - I will be making some final commits to it, so make sure you reload it every now and then. See you soon, -Alex and Sabrina +Alex, Sabrina and Javad diff --git a/public/01-access-machines.html b/public/01-access-machines.html index 2d41bf7..1aaad8f 100644 --- a/public/01-access-machines.html +++ b/public/01-access-machines.html @@ -3,9 +3,9 @@ - - - Accessing the machines, intro + + + Deep Learning in Neuroscience // on the Supercomputers of the Jülich Supercomputing Centre @@ -43,7 +43,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } - pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } + pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -225,9 +225,11 @@
-

Accessing the machines, intro

-

Alexandre Strube // Sabrina Benassou

-

June 25, 2024

+

Deep Learning in Neuroscience // on the +Supercomputers of the Jülich Supercomputing Centre

+

Alexandre Strube // Sabrina Benassou // Javad +Kasravi

+

November 19, 2024

@@ -235,28 +237,22 @@

Communication:

Links for the complimentary parts of this course:

Team:

+
+
+Javad Kasravi + +
+

-

Schedule for day 1

+

Schedule

@@ -311,39 +313,39 @@

Schedule for day 1

- + - + - + - + - - + + - + - + - + - + @@ -354,9 +356,9 @@

Schedule for day 1

Note

Please open this document on your own browser! We will need it for the exercises. https://go.fzj.de/bringing-dl-workloads-to-jsc

+href="https://go.fzj.de/dl-in-neuroscience">https://go.fzj.de/dl-in-neuroscience

-Mobile friendly, but you need it on your computer, really @@ -545,17 +547,17 @@

Connecting to Jureca DC

Getting compute time

  • Go to https://go.fzj.de/bringing-dl-workloads-to-jsc-project-join
  • +href="https://go.fzj.de/dl-in-neuroscience-project-join">https://go.fzj.de/dl-in-neuroscience-project-join
  • Join the course project -training2425
  • +training2441
  • Sign the Usage Agreements (Video)
  • Compute time allocation is based on compute projects. For every compute job, a compute project pays.
  • Time is measured in core-hours. One hour of Jureca -DC is 48 core-hours.
  • +DC is 128 core-hours.
  • Example: Job runs for 8 hours on 64 nodes of Jureca -DC: 8 * 64 * 48 = 24576 core-h!
  • +DC: 8 * 64 * 128 = 65536 core-h!
@@ -574,272 +576,8 @@

Jupyter

Jupyter

-

Pay -attention to the partition - DON’T RUN IT ON THE LOGIN NODE!!!

-
-

Connecting to Jureca DC

-
-
-

VSCode

- -
-
-

VSCode

-

Now with the remote explorer -tab

-

-
-
- -

SSH

-
    -
  • SSH is a secure shell (terminal) connection to -another computer
  • -
  • You connect from your computer to the LOGIN -NODE
  • -
  • Security is given by public/private keys
  • -
  • A connection to the supercomputer needs a -
      -
    1. Key,
    2. -
    3. Configuration
    4. -
    5. Key/IP address known to the supercomputer
    6. -
  • -
-
-
- -

SSH

-

Create key in -VSCode’s Terminal (menu View->Terminal)

-
mkdir ~/.ssh/
-ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSC
-
$ ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSC
-Generating public/private ed25519 key pair.
-Enter passphrase (empty for no passphrase): 
-Enter same passphrase again: 
-Your identification has been saved in /Users/strube1/.ssh/id_ed25519-JSC
-Your public key has been saved in /Users/strube1/.ssh/id_ed25519-JSC.pub
-The key fingerprint is:
-SHA256:EGNNC1NTaN8fHwpfuZRPa50qXHmGcQjxp0JuU0ZA86U strube1@Strube-16
-The keys randomart image is:
-+--[ED25519 256]--+
-|      *++oo=o. . |
-|     . =+o .= o  |
-|      .... o.E..o|
-|       .  +.+o+B.|
-|        S  =o.o+B|
-|          . o*.B+|
-|          . . =  |
-|           o .   |
-|            .    |
-+----[SHA256]-----+
-
-
- -

SSH

-

Configure SSH session

-
code $HOME/.ssh/config
-

Windows users, from Ubuntu WSL (Change username for your user on -windows)

-
ls -la /mnt/c/Users/
-mkdir /mnt/c/Users/USERNAME/.ssh/
-cp $HOME/.ssh/* /mnt/c/Users/USERNAME/.ssh/
-
-
- -

SSH

-

Configure SSH session

-
Host jureca
-        HostName jureca.fz-juelich.de
-        User [MY_USERNAME]   # Here goes your username, not the word MY_USERNAME.
-        AddressFamily inet
-        IdentityFile ~/.ssh/id_ed25519-JSC
-        MACs hmac-sha2-512-etm@openssh.com
-

Copy contents to the config file and save it

-

REPLACE [MY_USERNAME] WITH YOUR USERNAME!!! 🤦‍♂️

-
-
- -

SSH

-

JSC restricts from where -you can login

-

So we need to:

-
    -
  1. Find our ip range
  2. -
  3. Add the range and key to Judoor
  4. -
-
-
- -

SSH

-

Find your ip/name range

-

Open https://www.whatismyip.com

-
-
- -

SSH

-

Find your ip/name range

-

-
    -
  • Let’s keep this inside vscode: -code key.txt and paste the number you got
  • -
-
-
- -

SSH

-

Did everyone get their own ip address?

-
-
- -

SSH - EXAMPLE

-
    -
  • I will use the number -93.199.55.163
  • -
  • YOUR NUMBER IS DIFFERENT
  • -
  • Seriously
  • -
-
-
- -

SSH - Example: -93.199.55.163

-
    -
  • Go to VSCode and make it simpler, replace the 2nd -half with "0.0/16": -
      -
    • It was 93.199.55.163
    • -
    • Becomes 93.199.0.0/16 (with YOUR -number, not with the example)
    • -
  • -
  • Add a from="" around it
  • -
  • So, it looks like this, now: -from="93.199.0.0/16"
  • -
  • Add a second magic number, with a comma: -,10.0.0.0/8 🧙‍♀️
  • -
  • I promise, the magic is worth it 🧝‍♂️ (If time -allows)
  • -
  • In the end it looks like this: -from="93.199.0.0/16,10.0.0.0/8" 🎬
  • -
  • Keep it open, we will use it later
  • -
  • If you are from FZJ, also add “134.94.0.0/16” with -a comma
  • -
-
-
- -

SSH - Example: -93.199.0.0/16

-

Copy your ssh key

-
    -
  • Terminal: -code ~/.ssh/id_ed25519-JSC.pub

  • -
  • Something like this will open:

  • -
  • ssh-ed25519 AAAAC3NzaC1lZDE1NTA4AAAAIHaoOJF3gqXd7CV6wncoob0DL2OJNfvjgnHLKEniHV6F strube@demonstration.fz-juelich.de
  • -
  • Paste this line at the same key.txt -which you just opened

  • -
-
-
- -

SSH

-

Example: 93.199.0.0/16

-
    -
  • Put them together and copy again:
  • -
  • from="93.199.0.0/16,10.0.0.0/8" ssh-ed25519 AAAAC3NzaC1lZDE1NTA4AAAAIHaoOJF3gqXd7CV6wncoob0DL2OJNfvjgnHLKEniHV6F strube@demonstration.fz-juelich.de
  • -
-
-
- -

SSH

-
    -
  • Let’s add it on Judoor
  • -
  • -
  • Do it for JURECA and JUDAC with the same key
  • -
-
-
- -

SSH

-

Add new key to Judoor

-

-

This might take some minutes

-
-
- -

SSH: Exercise

-

That’s it! Give it a try (and answer yes)

-
$ ssh jureca
-The authenticity of host 'jrlogin03.fz-juelich.de (134.94.0.185)' cannot be established.
-ED25519 key fingerprint is SHA256:ASeu9MJbkFx3kL1FWrysz6+paaznGenChgEkUW8nRQU.
-This key is not known by any other names
-Are you sure you want to continue connecting (yes/no/[fingerprint])? Yes
-**************************************************************************
-*                            Welcome to Jureca DC                   *
-**************************************************************************
-...
-...
-strube1@jrlogin03~ $ 
-
-
- -

SSH: Exercise

-

Make sure you -are connected to the supercomputer

-
# Create a folder for myself
-mkdir $PROJECT_training2425/$USER
-
-# Create a shortcut for the project on the home folder
-rm -rf ~/course ; ln -s $PROJECT_training2425/$USER ~/course
-
-# Enter course folder and
-cd ~/course
-
-# Where am I?
-pwd
-
-# We well need those later
-mkdir ~/course/.cache
-mkdir ~/course/.config
-mkdir ~/course/.fastai
-
-rm -rf $HOME/.cache ; ln -s ~/course/.cache $HOME/
-rm -rf $HOME/.config ; ln -s ~/course/.config $HOME/
-rm -rf $HOME/.fastai ; ln -s ~/course/.fastai $HOME/
-

Working with the supercomputer’s software

@@ -854,27 +592,36 @@

Working with the supercomputer’s software

documentation
+
+

Luncher in Jupyter-JSC

+

+

Software

-

Tool for finding -software: module spider

-
strube1$ module spider PyTorch
-------------------------------------------------------------------------------------
-  PyTorch:
-------------------------------------------------------------------------------------
-    Description:
-      Tensors and Dynamic neural networks in Python with strong GPU acceleration. 
-      PyTorch is a deep learning framework that puts Python first.
-
-     Versions:
-        PyTorch/1.7.0-Python-3.8.5
-        PyTorch/1.8.1-Python-3.8.5
-        PyTorch/1.11-CUDA-11.5
-        PyTorch/1.12.0-CUDA-11.7
-     Other possible modules matches:
-        PyTorch-Geometric  PyTorch-Lightning
-...
+

Connect to terminal

+

+
+
+ +

Tool for finding +software: module spider

+
strube1$ module spider PyTorch
+------------------------------------------------------------------------------------
+  PyTorch:
+------------------------------------------------------------------------------------
+    Description:
+      Tensors and Dynamic neural networks in Python with strong GPU acceleration. 
+      PyTorch is a deep learning framework that puts Python first.
+
+     Versions:
+        PyTorch/1.7.0-Python-3.8.5
+        PyTorch/1.8.1-Python-3.8.5
+        PyTorch/1.11-CUDA-11.5
+        PyTorch/1.12.0-CUDA-11.7
+     Other possible modules matches:
+        PyTorch-Geometric  PyTorch-Lightning
+...

What do we have?

@@ -911,31 +658,31 @@

Example: PyTorch

Example: PyTorch

(make sure you are still connected to Jureca DC)

-
$ python
--bash: python: command not found
+
$ python
+-bash: python: command not found

Oh noes! 🙈

Let’s bring Python together with PyTorch!

Example: PyTorch

Copy and paste these lines

-
# This command fails, as we have no proper python
-python 
-# So, we load the correct modules...
-module load Stages/2024
-module load GCC OpenMPI Python PyTorch
-# And we run a small test: import pytorch and ask its version
-python -c "import torch ; print(torch.__version__)" 
+
# This command fails, as we have no proper python
+python 
+# So, we load the correct modules...
+module load Stages/2024
+module load GCC OpenMPI Python PyTorch
+# And we run a small test: import pytorch and ask its version
+python -c "import torch ; print(torch.__version__)" 

Should look like this:

-
$ python
--bash: python: command not found
-$ module load Stages/2024
-$ module load GCC OpenMPI Python PyTorch
-$ python -c "import torch ; print(torch.__version__)" 
-2.1.0
+
$ python
+-bash: python: command not found
+$ module load Stages/2024
+$ module load GCC OpenMPI Python PyTorch
+$ python -c "import torch ; print(torch.__version__)" 
+2.1.0

Python Modules

@@ -943,78 +690,63 @@

Python Modules

id="some-of-the-python-softwares-are-part-of-python-itself-or-of-other-softwares.-use-module-key">Some of the python softwares are part of Python itself, or of other softwares. Use “module key” -
module key toml
-The following modules match your search criteria: "toml"
-------------------------------------------------------------------------------------
-
-  Jupyter: Jupyter/2020.2.5-Python-3.8.5, Jupyter/2021.3.1-Python-3.8.5, Jupyter/2021.3.2-Python-3.8.5, Jupyter/2022.3.3, Jupyter/2022.3.4
-    Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.
-    
-
-  PyQuil: PyQuil/3.0.1
-    PyQuil is a library for generating and executing Quil programs on the Rigetti Forest platform.
-
-  Python: Python/3.8.5, Python/3.9.6, Python/3.10.4
-    Python is a programming language that lets you work more quickly and integrate your systems more effectively.
-
-------------------------------------------------------------------------------------
-
-
-

VSCode

-

Editing files on the -supercomputers

-

-
-
-

VSCode

-

-
-
-

VSCode

-
    -
  • You can have a terminal inside VSCode: -
      -
    • Go to the menu View->Terminal
    • -
  • -
+
module key toml
+The following modules match your search criteria: "toml"
+------------------------------------------------------------------------------------
+
+  Jupyter: Jupyter/2020.2.5-Python-3.8.5, Jupyter/2021.3.1-Python-3.8.5, Jupyter/2021.3.2-Python-3.8.5, Jupyter/2022.3.3, Jupyter/2022.3.4
+    Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.
+    
+
+  PyQuil: PyQuil/3.0.1
+    PyQuil is a library for generating and executing Quil programs on the Rigetti Forest platform.
+
+  Python: Python/3.8.5, Python/3.9.6, Python/3.10.4
+    Python is a programming language that lets you work more quickly and integrate your systems more effectively.
+
+------------------------------------------------------------------------------------
-
-

VSCode

-
    -
  • From the VSCode’s terminal, navigate to your -“course” folder and to the name you created earlier.

  • -
  • cd $HOME/course/
    -pwd
  • -
  • This is out working directory. We do everything -here.

  • -
+
+ +

How to run it on the login +node

+

create a python file

+

-

Demo code

-

Create a new -file “matrix.py” on VSCode on Jureca DC

-
code matrix.py
-

Paste this into the file:

-
import torch
-
-matrix1 = torch.randn(3,3)
-print("The first matrix is", matrix1)
-
-matrix2 = torch.randn(3,3)
-print("The second matrix is", matrix2)
-
-result = torch.matmul(matrix1,matrix2)
-print("The result is:\n", result)
+

create a python file

+

-

How to run it on the login -node

+

create an python file

+

+
+
+ +

create a python file

+
import torch
+
+matrix1 = torch.randn(3,3)
+print("The first matrix is", matrix1)
+
+matrix2 = torch.randn(3,3)
+print("The second matrix is", matrix2)
+
+result = torch.matmul(matrix1,matrix2)
+print("The result is:\n", result)
+
+
+ +

create a python file

+

+
+
+ +

Run code in login node

module load Stages/2023
 module load GCC OpenMPI PyTorch
 python matrix.py
@@ -1048,32 +780,34 @@

Slurm submission file

Slurm submission file example

-

code jureca-matrix.sbatch

-
#!/bin/bash
-#SBATCH --account=training2425           # Who pays?
-#SBATCH --nodes=1                        # How many compute nodes
-#SBATCH --job-name=matrix-multiplication
-#SBATCH --ntasks-per-node=1              # How many mpi processes/node
-#SBATCH --cpus-per-task=1                # How many cpus per mpi proc
-#SBATCH --output=output.%j        # Where to write results
-#SBATCH --error=error.%j
-#SBATCH --time=00:01:00          # For how long can it run?
-#SBATCH --partition=dc-gpu         # Machine partition
-#SBATCH --reservation=training2425 # For today only
-
-module load Stages/2024
-module load GCC OpenMPI PyTorch  # Load the correct modules on the compute node(s)
-
-srun python matrix.py            # srun tells the supercomputer how to run it
+

Create a file named jureca-matrix.sbatch as described in +the previous section, and copy all the content from the following into +this file.

+
#!/bin/bash
+#SBATCH --account=training2441           # Who pays?
+#SBATCH --nodes=1                        # How many compute nodes
+#SBATCH --job-name=matrix-multiplication
+#SBATCH --ntasks-per-node=1              # How many mpi processes/node
+#SBATCH --cpus-per-task=1                # How many cpus per mpi proc
+#SBATCH --output=output.%j        # Where to write results
+#SBATCH --error=error.%j
+#SBATCH --time=00:01:00          # For how long can it run?
+#SBATCH --partition=dc-gpu         # Machine partition
+#SBATCH --reservation=training2441 # For today only
+
+module load Stages/2024
+module load GCC OpenMPI PyTorch  # Load the correct modules on the compute node(s)
+
+srun python matrix.py            # srun tells the supercomputer how to run it

Submitting a job: SBATCH

-
sbatch jureca-matrix.sbatch
-
-Submitted batch job 412169
+
sbatch jureca-matrix.sbatch
+
+Submitted batch job 412169
@@ -1084,10 +818,10 @@

Are we there yet?

Are we there yet? 🐴

squeue --me

-
squeue --me
-   JOBID  PARTITION    NAME      USER    ST       TIME  NODES NODELIST(REASON)
-   412169 gpus         matrix-m  strube1 CF       0:02      1 jsfc013
+
squeue --me
+   JOBID  PARTITION    NAME      USER    ST       TIME  NODES NODELIST(REASON)
+   412169 gpus         matrix-m  strube1 CF       0:02      1 jsfc013

ST is status:

  • PD (pending),
  • @@ -1104,14 +838,14 @@

    Reservations

  • Some partitions have reservations, which means that only certain users can use them at certain times.
  • For this course, it’s called -training2425
  • +training2441

Job is wrong, need to cancel

-
scancel <JOBID>
+
scancel <JOBID>
@@ -1120,11 +854,8 @@

Check logs

id="by-now-you-should-have-output-and-error-log-files-on-your-directory.-check-them">By now you should have output and error log files on your directory. Check them! -
# Notice that this number is the job id. It's different for every job
-cat output.412169 
-cat error.412169 
-

Or simply open it on VSCode!

+

simply open output.412169 and error.412169 +using Editor!!

Extra software, modules and kernels

@@ -1133,9 +864,9 @@

You want that extra

Venv/Kernel template

-
cd $HOME/course/
-git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
+
cd $HOME/course/
+git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git

Example: Let’s install some software!

@@ -1154,11 +885,11 @@

Example: Let’s install
  • Edit the file sc_venv_template/requirements.txt

  • Add these lines at the end:

  • -
  • fastai
    -wandb
    -accelerate
    -deepspeed
  • +
  • fastai
    +wandb
    +accelerate
    +deepspeed
  • Run on the terminal: sc_venv_template/setup.sh

  • @@ -1168,27 +899,27 @@

    Example: Let’s install

    Example: Activating the virtual environment

      -
    • source sc_venv_template/activate.sh
    • +
    • source sc_venv_template/activate.sh

    Example: Activating the virtual environment

    -
    source ./activate.sh 
    -The activation script must be sourced, otherwise the virtual environment will not work.
    -Setting vars
    -The following modules were not unloaded:
    -  (Use "module --force purge" to unload all):
    - 1) Stages/2024
    -
    jureca01 $ python
    -Python 3.11.3 (main, Jun 25 2023, 13:17:30) [GCC 12.3.0]
    ->>> import fastai
    ->>> fastai.__version__
    -'2.7.14'
    +
    source ./activate.sh 
    +The activation script must be sourced, otherwise the virtual environment will not work.
    +Setting vars
    +The following modules were not unloaded:
    +  (Use "module --force purge" to unload all):
    + 1) Stages/2024
    +
    jureca01 $ python
    +Python 3.11.3 (main, Jun 25 2023, 13:17:30) [GCC 12.3.0]
    +>>> import fastai
    +>>> fastai.__version__
    +'2.7.14'
    @@ -1196,60 +927,60 @@

    Let’s train a 🐈 classifier!

    • This is a minimal demo, to show some quirks of the supercomputer
    • -
    • code cats.py
    • -
    • from fastai.vision.all import *
      -from fastai.callback.tensorboard import *
      -#
      -print("Downloading dataset...")
      -path = untar_data(URLs.PETS)/'images'
      -print("Finished downloading dataset")
      -#
      -def is_cat(x): return x[0].isupper()
      -# Create the dataloaders and resize the images
      -dls = ImageDataLoaders.from_name_func(
      -    path, get_image_files(path), valid_pct=0.2, seed=42,
      -    label_func=is_cat, item_tfms=Resize(224))
      -print("On the login node, this will download resnet34")
      -learn = vision_learner(dls, resnet34, metrics=accuracy)
      -cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
      -# Trains the model for 6 epochs with this dataset
      -learn.unfreeze()
      -learn.fit_one_cycle(6, cbs=cbs)
    • +
    • code cats.py
    • +
    • from fastai.vision.all import *
      +from fastai.callback.tensorboard import *
      +#
      +print("Downloading dataset...")
      +path = untar_data(URLs.PETS)/'images'
      +print("Finished downloading dataset")
      +#
      +def is_cat(x): return x[0].isupper()
      +# Create the dataloaders and resize the images
      +dls = ImageDataLoaders.from_name_func(
      +    path, get_image_files(path), valid_pct=0.2, seed=42,
      +    label_func=is_cat, item_tfms=Resize(224))
      +print("On the login node, this will download resnet34")
      +learn = vision_learner(dls, resnet34, metrics=accuracy)
      +cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
      +# Trains the model for 6 epochs with this dataset
      +learn.unfreeze()
      +learn.fit_one_cycle(6, cbs=cbs)

    Submission file for the classifier

    -
    code fastai.sbatch
    -
    #!/bin/bash
    -#SBATCH --account=training2425
    -#SBATCH --mail-user=MYUSER@fz-juelich.de
    -#SBATCH --mail-type=ALL
    -#SBATCH --nodes=1
    -#SBATCH --job-name=cat-classifier
    -#SBATCH --ntasks-per-node=1
    -#SBATCH --cpus-per-task=128
    -#SBATCH --output=output.%j
    -#SBATCH --error=error.%j
    -#SBATCH --time=00:20:00
    -#SBATCH --partition=dc-gpu
    -#SBATCH --reservation=training2425 # For today only
    -
    -cd $HOME/course/
    -source sc_venv_template/activate.sh # Now we finally use the fastai module
    -
    -srun python cats.py
    +
    code fastai.sbatch
    +
    #!/bin/bash
    +#SBATCH --account=training2441
    +#SBATCH --mail-user=MYUSER@fz-juelich.de
    +#SBATCH --mail-type=ALL
    +#SBATCH --nodes=1
    +#SBATCH --job-name=cat-classifier
    +#SBATCH --ntasks-per-node=1
    +#SBATCH --cpus-per-task=128
    +#SBATCH --output=output.%j
    +#SBATCH --error=error.%j
    +#SBATCH --time=00:20:00
    +#SBATCH --partition=dc-gpu
    +#SBATCH --reservation=training2441 # For today only
    +
    +cd $HOME/course/
    +source sc_venv_template/activate.sh # Now we finally use the fastai module
    +
    +srun python cats.py

    Submit it

    -
    sbatch fastai.sbatch
    +
    sbatch fastai.sbatch
    @@ -1262,17 +993,17 @@

    Submission time

    Probably not much happening…

      -
    • $ cat output.7948496 
      -The activation script must be sourced, otherwise the virtual environment will not work.
      -Setting vars
      -Downloading dataset...
    • -
    • $ cat err.7948496 
      -The following modules were not unloaded:
      -(Use "module --force purge" to unload all):
      -
      -1) Stages/2024
    • +
    • $ cat output.7948496 
      +The activation script must be sourced, otherwise the virtual environment will not work.
      +Setting vars
      +Downloading dataset...
    • +
    • $ cat err.7948496 
      +The following modules were not unloaded:
      +(Use "module --force purge" to unload all):
      +
      +1) Stages/2024
    @@ -1286,15 +1017,15 @@

    What happened?

  • Check the error.${JOBID} file
  • If you run it longer, you will get the actual error:
  • -
  • Traceback (most recent call last):
    -  File "/p/project/training2425/strube1/cats.py", line 5, in <module>
    -    path = untar_data(URLs.PETS)/'images'
    -    ...
    -    ...
    -    raise URLError(err)
    -urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
    -srun: error: jwb0160: task 0: Exited with exit code 1
  • +
  • Traceback (most recent call last):
    +  File "/p/project/training2441/strube1/cats.py", line 5, in <module>
    +    path = untar_data(URLs.PETS)/'images'
    +    ...
    +    ...
    +    raise URLError(err)
    +urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
    +srun: error: jwb0160: task 0: Exited with exit code 1
  • @@ -1305,12 +1036,12 @@

    🤔…

    What is it doing?

    • This downloads the dataset:
    • -
    • path = untar_data(URLs.PETS)/'images'
    • +
    • path = untar_data(URLs.PETS)/'images'
    • And this one downloads the pre-trained weights:
    • -
    • learn = vision_learner(dls, resnet34, metrics=error_rate)
    • +
    • learn = vision_learner(dls, resnet34, metrics=error_rate)
    @@ -1336,61 +1067,61 @@

    Compute nodes have no internet connection

    On the login node:

    • Comment out the line which does AI training:
    • -
    • # learn.fit_one_cycle(6, cbs=cbs)
    • +
    • # learn.fit_one_cycle(6, cbs=cbs)
    • Call our code on the login node!
    • -
    • source sc_venv_template/activate.sh # So that we have fast.ai library
      -python cats.py
    • +
    • source sc_venv_template/activate.sh # So that we have fast.ai library
      +python cats.py

    Run the downloader on the login node

    -
    $ source sc_venv_template/activate.sh
    -$ python cats.py 
    -Downloading dataset...
    - |████████-------------------------------| 23.50% [190750720/811706944 00:08<00:26]
    - Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /p/project/ccstao/cstao05/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
    -100%|█████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 266MB/s]
    +
    $ source sc_venv_template/activate.sh
    +$ python cats.py 
    +Downloading dataset...
    + |████████-------------------------------| 23.50% [190750720/811706944 00:08<00:26]
    + Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /p/project/ccstao/cstao05/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
    +100%|█████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 266MB/s]

    Run it again on the compute nodes!

    • Un-comment back the line that does training:
    • -
    • learn.fit_one_cycle(6, cbs=cbs)
    • +
    • learn.fit_one_cycle(6, cbs=cbs)
    • Submit the job!
    • -
    • sbatch fastai.sbatch
    • +
    • sbatch fastai.sbatch

    Masoquistically waiting for the job to run?

    -
    watch squeue --me
    +
    watch squeue --me

    (To exit, type CTRL-C)

    Check output files

    • You can see them within VSCode
    • -
    • The activation script must be sourced, otherwise the virtual environment will not work.
      -Setting vars
      -Downloading dataset...
      -Finished downloading dataset
      -epoch     train_loss  valid_loss  error_rate  time    
      -Epoch 1/1 : |-----------------------------------| 0.00% [0/92 00:00<?]
      -Epoch 1/1 : |-----------------------------------| 2.17% [2/92 00:14<10:35 1.7452]
      -Epoch 1/1 : |█----------------------------------| 3.26% [3/92 00:14<07:01 1.6413]
      -Epoch 1/1 : |██---------------------------------| 5.43% [5/92 00:15<04:36 1.6057]
      -...
      -....
      -Epoch 1/1 :
      -epoch     train_loss  valid_loss  error_rate  time    
      -0         0.049855    0.021369    0.007442    00:42     
    • +
    • The activation script must be sourced, otherwise the virtual environment will not work.
      +Setting vars
      +Downloading dataset...
      +Finished downloading dataset
      +epoch     train_loss  valid_loss  error_rate  time    
      +Epoch 1/1 : |-----------------------------------| 0.00% [0/92 00:00<?]
      +Epoch 1/1 : |-----------------------------------| 2.17% [2/92 00:14<10:35 1.7452]
      +Epoch 1/1 : |█----------------------------------| 3.26% [3/92 00:14<07:01 1.6413]
      +Epoch 1/1 : |██---------------------------------| 5.43% [5/92 00:15<04:36 1.6057]
      +...
      +....
      +Epoch 1/1 :
      +epoch     train_loss  valid_loss  error_rate  time    
      +0         0.049855    0.021369    0.007442    00:42     
    • 🎉
    • 🥳
    @@ -1404,16 +1135,16 @@

    Tools for results analysis

    Tensorboard
  • And we already have the code for it on our example!
  • -
  • cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
  • +
  • cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
  • Example: Tensorboard

    • The command
    • -
    • tensorboard --logdir=runs  --port=9999 serve
    • +
    • tensorboard --logdir=runs  --port=9999 serve
    • Opens a connection on port 9999… OF THE SUPERCOMPUTER.
    • This port is behind the firewall. You can’t access @@ -1437,22 +1168,20 @@

      Port Forwarding

      supercomputer’s port 3000 as port 1234 locally
    -
    -

    Port forwarding demo:

    -
      -
    • On VSCode’s terminal:
    • -
    • cd $HOME/course/
      -source sc_venv_template/activate.sh
      -tensorboard --logdir=runs  --port=12345 serve
    • -
    • Note the tab PORTS next to the -terminal
    • -
    • On the browser: http://localhost:12345
    • -
    -
    +

    Tensorboard on Jureca DC

    @@ -1475,176 +1204,6 @@

    Day 1 recap

    ANY QUESTIONS??

    Feedback is more than welcome!

    - -
    - -

    Helmholtz Blablador

    -

    -
    -
    - -

    Blablador

    -
      -
    • Blablador is our Large Language Model inference -server (eg. ChatGPT)
    • -
    • It’s a service for the Helmholtz Association. -
        -
      • It’s fast, free and PRIVATE - I don’t record your -conversations!
      • -
    • -
    • Anyone here can use it
    • -
    -
    -
    - -

    Blablador

    -
    - -
    https://helmholtz-blablador.fz-juelich.de
    -
    -
    -
    -

    VScode + Continue.dev

    -

    -
    -
    - -

    Obtaining a token

    -
      -
    • Go to helmholtz codebase at http://codebase.helmholtz.cloud
    • -
    • Log in with your email
    • -
    • On the left side, click on your profile, and then -on “Preferences”
    • -
    • On “Access tokens”, click “Add new token”, -
        -
      • give it a name,
      • -
      • put an expiration date (max 1 year)
      • -
      • and choose “api” in the “scopes” section
      • -
    • -
    • Click “Create Personal Access Token” -
        -
      • You will see a “………………………..” - copy this and save -somewhere.
      • -
    • -
    -
    -
    - -

    Blablador

    -

    -
    -
    - -

    Blablador on VSCode!

    -
      -
    • Add continue.dev -extension to VSCode
    • -
    • On Continue, choose to add model, choose Other -OpenAI-compatible API
    • -
    • Click in Open Config.json at the end
    • -
    -
    -
    -

    Blablador: VScode + Continue.dev

    -
      -
    • Inside config.json, add at the -"models" section:

    • -
    •     {
      -      "title": "Mistral helmholtz",
      -      "provider": "openai",
      -      "contextLength": 16384,
      -      "model": "alias-code",
      -      "apiKey": "ADD-YOUR-TOKEN-HERE",
      -      "apiBase": "https://helmholtz-blablador.fz-juelich.de:8000"
      -    },
    • -
    • REPLACE THE APIKEY WITH YOUR OWN -TOKEN!!!!

    • -
    -
    -
    - -

    Blablador on VSCode

    -
      -
    • Click on the “Continue.dev extension on the left -side of VSCode.
    • -
    • Select some code from our exercises, select it and -send it to continue with cmd-shift-L (or ctrl-shift-L)
    • -
    • Ask it to add unit tests, for example.
    • -
    -
    -
    -

    Backup slides

    -
    -
    -

    There’s more!

    -
      -
    • Remember the magic? 🧙‍♂️
    • -
    • Let’s use it now to access the compute nodes -directly!
    • -
    -
    -
    -

    Proxy Jump

    -

    Accessing compute nodes -directly

    -
      -
    • If we need to access some ports on the compute -nodes
    • -
    • -
    -
    -
    -

    Proxy Jump - SSH Configuration

    -

    Type on your machine “code $HOME/.ssh/config” and paste -this at the end:

    -
    
    -# -- Compute Nodes --
    -Host *.jureca
    -        User [ADD YOUR USERNAME HERE]
    -        StrictHostKeyChecking no
    -        IdentityFile ~/.ssh/id_ed25519-JSC
    -        ProxyJump jureca
    -
    -
    -

    Proxy Jump: Connecting to a node

    -

    Example: A service provides web interface on port 9999

    -

    On the supercomputer:

    -
    srun --time=00:05:00 \
    -     --nodes=1 --ntasks=1 \
    -     --partition=dc-gpu \
    -     --account training2425 \
    -     --cpu_bind=none \
    -     --pty /bin/bash -i
    -
    -bash-4.4$ hostname # This is running on a compute node of the supercomputer
    -jwb0002
    -
    -bash-4.4$ cd $HOME/course/
    -bash-4.4$ source sc_venv_template/activate.sh
    -bash-4.4$ tensorboard --logdir=runs  --port=9999 serve
    -
    -
    -

    Proxy Jump

    -

    On your machine:

    -
      -
    • ssh -L :3334:localhost:9999 jrc002i.jureca
    • -
    • Mind the i letter I added at the -end of the hostname

    • -
    • Now you can access the service on your local -browser at http://localhost:3334

    • -
    -
    -
    - -

    Now that’s really the end! 😓

    diff --git a/public/03-parallelize-training.html b/public/02-parallelize-training.html similarity index 73% rename from public/03-parallelize-training.html rename to public/02-parallelize-training.html index 69fb6d3..0a9eb0e 100644 --- a/public/03-parallelize-training.html +++ b/public/02-parallelize-training.html @@ -3,8 +3,8 @@ - - + + Bringing Deep Learning Workloads to JSC supercomputers @@ -20,11 +20,8 @@ div.columns{display: flex; gap: min(4vw, 1.5em);} div.column{flex: auto; overflow-x: auto;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} - /* The extra [class] is a hack that increases specificity enough to - override a similar rule in reveal.js */ - ul.task-list[class]{list-style: none;} + ul.task-list{list-style: none;} ul.task-list li input[type="checkbox"] { - font-size: inherit; width: 0.8em; margin: 0 0.8em 0.2em -1.6em; vertical-align: middle; @@ -32,7 +29,7 @@ .display.math{display: block; text-align: center; margin: 0.5rem auto;} /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } - pre > code.sourceCode > span { line-height: 1.25; } + pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -43,7 +40,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } - pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } + pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -228,140 +225,243 @@

    Bringing Deep Learning Workloads to JSC supercomputers

    Parallelize Training

    -

    Alexandre Strube // Sabrina Benassou

    -

    June 25, 2024

    +

    Alexandre Strube // Sabrina Benassou // Javad +Kasravi

    +

    November 19, 2024

    +
    +

    Good practice

    +
      +
    • Always store your code in the project folder. In +our case
    • +
    • /p/project/training2441/$USER
    • +
    • Store data in the scratch directory for faster I/O +access. Files in scratch are deleted after 90 days of inactivity.
    • +
    • /p/scratch/training2441/$USER
    • +
    • Store the data in $DATA_dataset for a +more permanent location.This location is not accessible by compute +nodes. You have to Join the project in +order to store and access data
    • +
    +
    +
    +

    We need to download some code

    +
    cd $HOME/course
    +git clone https://github.com/HelmholtzAI-FZJ/2024-11-course-deep-learning-in-neuroscience
    +

    The ResNet50 Model

    +
    +

    The ImageNet dataset

    +

    Large Scale +Visual Recognition Challenge (ILSVRC)

    +
      +
    • An image dataset organized according to the WordNet hierarchy.
    • +
    • Extensively used in algorithms for object detection +and image classification at large scale.
    • +
    • It has 1000 classes, that comprises 1.2 million +images for training, and 50,000 images for the validation set.
    • +
    +

    +

    ImageNet class

    -
    class ImageNet(Dataset):
    -    def __init__(self, root, split, transform=None):
    -        if split not in ["train", "val"]:
    -            raise ValueError("split must be either 'train' or 'val'")
    -        
    -        self.root = root
    -        
    -        with open(os.path.join(root, "imagenet_{}.json".format(split)), "rb") as f:
    -            data = json.load(f)
    -
    -        self.samples = list(data.keys())
    -        self.targets = list(data.values())
    -        self.transform = transform
    -        
    -                
    -    def __len__(self):
    -        return len(self.samples)    
    -    
    -    def __getitem__(self, idx):
    -        x = Image.open(os.path.join(self.root, self.samples[idx])).convert("RGB")
    -        if self.transform:
    -            x = self.transform(x)
    -        return x, self.targets[idx]
    -    
    +
    class ImageNet(Dataset):
    +    def __init__(self, root, split, transform=None):
    +        if split not in ["train", "val"]:
    +            raise ValueError("split must be either 'train' or 'val'")
    +        
    +        self.root = root
    +        
    +        with open(os.path.join(root, "imagenet_{}.pk".format(split)), "rb") as f:
    +            data = pickle.load(f)
    +
    +        self.samples = list(data.keys())
    +        self.targets = list(data.values())
    +        self.transform = transform
    +        
    +                
    +    def __len__(self):
    +        return len(self.samples)    
    +    
    +    def __getitem__(self, idx):
    +        x = Image.open(os.path.join(self.root, self.samples[idx])).convert("RGB")
    +        if self.transform:
    +            x = self.transform(x)
    +        return x, self.targets[idx]
    +    

    PyTorch Lightning Data Module

    -
    class ImageNetDataModule(pl.LightningDataModule):
    -    def __init__(
    -        self,
    -        data_root: str,
    -        batch_size: int,
    -        num_workers: int,
    -        dataset_transforms: dict(),
    -    ):
    -        super().__init__()
    -        self.data_root = data_root
    -        self.batch_size = batch_size
    -        self.num_workers = num_workers
    -        self.dataset_transforms = dataset_transforms
    -        
    -    def setup(self, stage: Optional[str] = None):
    -        self.train = ImageNet(self.data_root, "train", self.dataset_transforms)
    -            
    -    def train_dataloader(self):
    -        return DataLoader(self.train, batch_size=self.batch_size, \
    -            num_workers=self.num_workers)
    +
    class ImageNetDataModule(pl.LightningDataModule):
    +    def __init__(
    +        self,
    +        data_root: str,
    +        batch_size: int,
    +        num_workers: int,
    +        dataset_transforms: dict(),
    +    ):
    +        super().__init__()
    +        self.data_root = data_root
    +        self.batch_size = batch_size
    +        self.num_workers = num_workers
    +        self.dataset_transforms = dataset_transforms
    +        
    +    def setup(self, stage: Optional[str] = None):
    +        self.train = ImageNet(self.data_root, "train", self.dataset_transforms)
    +            
    +    def train_dataloader(self):
    +        return DataLoader(self.train, batch_size=self.batch_size, \
    +            num_workers=self.num_workers)

    PyTorch Lightning Module

    -
    class resnet50Model(pl.LightningModule):
    -    def __init__(self):
    -        super().__init__()
    -        self.model = resnet50(pretrained=True)
    -
    -    def forward(self, x):
    -        return self.model(x)
    -
    -    def training_step(self,batch):
    -        x, labels = batch
    -        pred=self.forward(x)
    -        train_loss = F.cross_entropy(pred, labels)
    -        self.log("training_loss", train_loss)
    -    
    -        return train_loss
    -
    -    def configure_optimizers(self):
    -            return torch.optim.Adam(self.parameters(), lr=0.02)
    +
    class resnet50Model(pl.LightningModule):
    +    def __init__(self):
    +        super().__init__()
    +        weights = ResNet50_Weights.DEFAULT
    +        self.model = resnet50(weights=weights)
    +
    +    def forward(self, x):
    +        return self.model(x)
    +
    +    def training_step(self,batch):
    +        x, labels = batch
    +        pred=self.forward(x)
    +        train_loss = F.cross_entropy(pred, labels)
    +        self.log("training_loss", train_loss)
    +    
    +        return train_loss
    +
    +    def configure_optimizers(self):
    +            return torch.optim.Adam(self.parameters(), lr=0.02)

    One GPU training

    -
    transform = transforms.Compose([
    -    transforms.ToTensor(),
    -    transforms.Resize((256, 256))
    -])
    -
    -# 1. Organize the data
    -datamodule = ImageNetDataModule("/p/scratch/training2425/data/", 256, \
    -    int(os.getenv('SLURM_CPUS_PER_TASK')), transform)
    -# 2. Build the model using desired Task
    -model = resnet50Model()
    -# 3. Create the trainer
    -trainer = pl.Trainer(max_epochs=10,  accelerator="gpu")
    -# 4. Train the model
    -trainer.fit(model, datamodule=datamodule)
    -# 5. Save the model!
    -trainer.save_checkpoint("image_classification_model.pt")
    +
    transform = transforms.Compose([
    +    transforms.ToTensor(),
    +    transforms.Resize((256, 256))
    +])
    +
    +# 1. Organize the data
    +datamodule = ImageNetDataModule("/p/scratch/training2441/", 256, \
    +    int(os.getenv('SLURM_CPUS_PER_TASK')), transform)
    +# 2. Build the model using desired Task
    +model = resnet50Model()
    +# 3. Create the trainer
    +trainer = pl.Trainer(max_epochs=10,  accelerator="gpu")
    +# 4. Train the model
    +trainer.fit(model, datamodule=datamodule)
    +# 5. Save the model!
    +trainer.save_checkpoint("image_classification_model.pt")

    One GPU training

    -
    #!/bin/bash -x
    -#SBATCH --nodes=1            
    -#SBATCH --gres=gpu:1
    -#SBATCH --ntasks-per-node=1  
    -#SBATCH --cpus-per-task=96
    -#SBATCH --time=06:00:00
    -#SBATCH --partition=dc-gpu
    -#SBATCH --account=training2425
    -#SBATCH --output=%j.out
    -#SBATCH --error=%j.err
    -#SBATCH --reservation=training2425 
    -
    -# To get number of cpu per task
    -export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"
    -# activate env
    -source $HOME/course/$USER/sc_venv_template/activate.sh
    -# run script from above
    -time srun python3 gpu_training.py
    -
    real    342m11.864s
    +
    #!/bin/bash -x
    +#SBATCH --nodes=1            
    +#SBATCH --gres=gpu:1
    +#SBATCH --ntasks-per-node=1  
    +#SBATCH --cpus-per-task=128
    +#SBATCH --time=06:00:00
    +#SBATCH --partition=dc-gpu
    +#SBATCH --account=training2441
    +#SBATCH --output=%j.out
    +#SBATCH --error=%j.err
    +#SBATCH --reservation=training2441
    +
    +# To get number of cpu per task
    +export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"
    +# activate env
    +source $HOME/course/$USER/sc_venv_template/activate.sh
    +# run script from above
    +time srun python3 gpu_training.py
    +
    real    342m11.864s

    DEMO

    But what about many GPUs?

    +
    +
      +
    • We make use of the GPU of our supercomputer and +distribute our training to make training faster.
    • It’s when things get interesting
    +
    +
    +

    +
    +
    +
    +
    +

    Distributed Training

    +
      +
    • Parallelize the training across multiple +nodes,
    • +
    • Significantly enhancing training speed and model +accuracy.
    • +
    • It is particularly beneficial for large models and +computationally intensive tasks, such as deep learning.[1]
    • +
    +
    +
    + + + +
    +
    + + +
    +
    +

    Parallel Training with PyTorch DDP

    +
      +
    • PyTorch’s +DDP (Distributed Data Parallel) works as follows: +
        +
      • Each GPU across each node gets its own +process.
      • +
      • Each GPU gets visibility into a subset of the +overall dataset. It will only ever see that subset.
      • +
      • Each process inits the model.
      • +
      • Each process performs a full forward and backward +pass in parallel.
      • +
      • The gradients are synced and averaged across all +processes.
      • +
      • Each process updates its optimizer.
      • +
    • +

    Data Parallel

    @@ -378,26 +478,26 @@

    Data Parallel - Averaging

    Multi-GPU training

    1 node and 4 GPU

    -
    #!/bin/bash -x
    -#SBATCH --nodes=1                     
    -#SBATCH --gres=gpu:4                  # Use the 4 GPUs available
    -#SBATCH --ntasks-per-node=4           # When using pl it should always be set to 4
    -#SBATCH --cpus-per-task=24            # Divide the number of cpus (96) by the number of GPUs (4)
    -#SBATCH --time=02:00:00
    -#SBATCH --partition=dc-gpu
    -#SBATCH --account=training2425
    -#SBATCH --output=%j.out
    -#SBATCH --error=%j.err
    -#SBATCH --reservation=training2425 
    -
    -export CUDA_VISIBLE_DEVICES=0,1,2,3    # Very important to make the GPUs visible
    -export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"
    -
    -source $HOME/course/$USER/sc_venv_template/activate.sh
    -time srun python3 gpu_training.py
    -
    real    89m15.923s
    +
    #!/bin/bash -x
    +#SBATCH --nodes=1                     
    +#SBATCH --gres=gpu:4                  # Use the 4 GPUs available
    +#SBATCH --ntasks-per-node=4           # When using pl it should always be set to 4
    +#SBATCH --cpus-per-task=32            # Divide the number of cpus (128) by the number of GPUs (4)
    +#SBATCH --time=02:00:00
    +#SBATCH --partition=dc-gpu
    +#SBATCH --account=training2441
    +#SBATCH --output=%j.out
    +#SBATCH --error=%j.err
    +#SBATCH --reservation=training2441
    +
    +export CUDA_VISIBLE_DEVICES=0,1,2,3    # Very important to make the GPUs visible
    +export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"
    +
    +source $HOME/course/$USER/sc_venv_template/activate.sh
    +time srun python3 gpu_training.py
    +
    real    89m15.923s

    DEMO

    @@ -424,6 +524,137 @@

    Data Parallel - Multi Node

    Data Parallel - Multi Node

    +
    +

    DDP steps

    +
      +
    1. Set up the environement variables for the +distributed mode (WORLD_SIZE, RANK, LOCAL_RANK …)
    2. +
    +
      +
    • # The number of total processes started by Slurm.
      +ntasks = os.getenv('SLURM_NTASKS')
      +# Index of the current process.
      +rank = os.getenv('SLURM_PROCID')
      +# Index of the current process on this node only.
      +local_rank = os.getenv('SLURM_LOCALID')
      +# The number of nodes
      +nnodes = os.getenv("SLURM_NNODES")
    • +
    +
    +
    +

    DDP steps

    +
      +
    1. Initialize a sampler to specify the sequence of +indices/keys used in data loading.
    2. +
    3. Implements data parallelism of the model.
    4. +
    5. Allow only one process to save checkpoints.
    6. +
    +
      +
    • datamodule = ImageNetDataModule("/p/scratch/training2441/", 256, \
      +    int(os.getenv('SLURM_CPUS_PER_TASK')), transform)
      +trainer = pl.Trainer(max_epochs=10,  accelerator="gpu", num_nodes=nnodes)
      +trainer.fit(model, datamodule=datamodule)
      +trainer.save_checkpoint("image_classification_model.pt")
    • +
    +
    +
    +

    Multi-Node training

    +
    transform = transforms.Compose([
    +    transforms.ToTensor(),
    +    transforms.Resize((256, 256))
    +])
    +
    +# 1. The number of nodes
    +nnodes = os.getenv("SLURM_NNODES")
    +# 2. Organize the data
    +datamodule = ImageNetDataModule("/p/scratch/training2441/", 128, \
    +    int(os.getenv('SLURM_CPUS_PER_TASK')), transform)
    +# 3. Build the model using desired Task
    +model = resnet50Model()
    +# 4. Create the trainer
    +trainer = pl.Trainer(max_epochs=10,  accelerator="gpu", num_nodes=nnodes)
    +# 5. Train the model
    +trainer.fit(model, datamodule=datamodule)
    +# 6. Save the model!
    +trainer.save_checkpoint("image_classification_model.pt")
    +
    +
    +

    Multi-Node training

    +

    16 nodes and 4 GPU each

    +
    #!/bin/bash -x
    +#SBATCH --nodes=16                     # This needs to match Trainer(num_nodes=...)
    +#SBATCH --gres=gpu:4                   # Use the 4 GPUs available
    +#SBATCH --ntasks-per-node=4            # When using pl it should always be set to 4
    +#SBATCH --cpus-per-task=32             # Divide the number of cpus (128) by the number of GPUs (4)
    +#SBATCH --time=00:15:00
    +#SBATCH --partition=dc-gpu
    +#SBATCH --account=training2441
    +#SBATCH --output=%j.out
    +#SBATCH --error=%j.err
    +#SBATCH --reservation=training2441
    +
    +export CUDA_VISIBLE_DEVICES=0,1,2,3    # Very important to make the GPUs visible
    +export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"
    +
    +source $HOME/course/$USER/sc_venv_template/activate.sh
    +time srun python3 ddp_training.py
    +
    real    6m56.457s
    +
    +
    +

    Multi-Node training

    +

    With 4 nodes:

    +
    real    24m48.169s
    +

    With 8 nodes:

    +
    real    13m10.722s
    +

    With 16 nodes:

    +
    real    6m56.457s
    +

    With 32 nodes:

    +
    real    4m48.313s
    +
    +
    +

    Data Parallel

    + +
      +
    • It was
    • +
    • trainer = pl.Trainer(max_epochs=10,  accelerator="gpu")
    • +
    • Became
    • +
    • nnodes = os.getenv("SLURM_NNODES")
      +trainer = pl.Trainer(max_epochs=10,  accelerator="gpu", num_nodes=nnodes)
    • +
    +
    +
    +

    Data Parallel

    + +
      +
    • It was
    • +
    • #SBATCH --nodes=1                
      +#SBATCH --gres=gpu:1
      +#SBATCH --ntasks-per-node=1
      +#SBATCH --cpus-per-task=128
    • +
    • Became
    • +
    • #SBATCH --nodes=16                   # This needs to match Trainer(num_nodes=...)
      +#SBATCH --gres=gpu:4                 # Use the 4 GPUs available
      +#SBATCH --ntasks-per-node=4          # When using pl it should always be set to 4
      +#SBATCH --cpus-per-task=32           # Divide the number of cpus (128) by the number of GPUs (4)
      +export CUDA_VISIBLE_DEVICES=0,1,2,3  # Very important to make the GPUs visible
    • +
    +
    +
    +

    DEMO

    +

    Before we go further…

      @@ -602,183 +833,20 @@

      Recap

    -
    -

    Parallel Training with PyTorch DDP

    -
      -
    • PyTorch’s -DDP (Distributed Data Parallel) works as follows: -
        -
      • Each GPU across each node gets its own -process.
      • -
      • Each GPU gets visibility into a subset of the -overall dataset. It will only ever see that subset.
      • -
      • Each process inits the model.
      • -
      • Each process performs a full forward and backward -pass in parallel.
      • -
      • The gradients are synced and averaged across all -processes.
      • -
      • Each process updates its optimizer.
      • -
    • -
    -
    -
    -

    Terminologies

    -
      -
    • WORLD_SIZE: number of processes participating in -the job.
    • -
    • RANK: the rank of the process in the network.
    • -
    • LOCAL_RANK: the rank of the process on the local -machine.
    • -
    • MASTER_PORT: free port on machine with rank 0. -
    • -
    -
    -
    -

    DDP steps

    -
      -
    1. Set up the environement variables for the -distributed mode (WORLD_SIZE, RANK, LOCAL_RANK …)
    2. -
    -
      -
    • # The number of total processes started by Slurm.
      -ntasks = os.getenv('SLURM_NTASKS')
      -# Index of the current process.
      -rank = os.getenv('SLURM_PROCID')
      -# Index of the current process on this node only.
      -local_rank = os.getenv('SLURM_LOCALID')
      -# The number of nodes
      -nnodes = os.getenv("SLURM_NNODES")
    • -
    -
    -
    -

    DDP steps

    -
      -
    1. Initialize a sampler to specify the sequence of -indices/keys used in data loading.
    2. -
    3. Implements data parallelism of the model.
    4. -
    5. Allow only one process to save checkpoints.
    6. -
    -
      -
    • datamodule = ImageNetDataModule("/p/scratch/training2425/data/", 256, \
      -    int(os.getenv('SLURM_CPUS_PER_TASK')), transform)
      -trainer = pl.Trainer(max_epochs=10,  accelerator="gpu", num_nodes=nnodes)
      -trainer.fit(model, datamodule=datamodule)
      -trainer.save_checkpoint("image_classification_model.pt")
    • -
    -
    -
    -

    DDP steps

    -
    transform = transforms.Compose([
    -    transforms.ToTensor(),
    -    transforms.Resize((256, 256))
    -])
    -
    -# 1. The number of nodes
    -nnodes = os.getenv("SLURM_NNODES")
    -# 2. Organize the data
    -datamodule = ImageNetDataModule("/p/scratch/training2425/data/", 128, \
    -    int(os.getenv('SLURM_CPUS_PER_TASK')), transform)
    -# 3. Build the model using desired Task
    -model = resnet50Model()
    -# 4. Create the trainer
    -trainer = pl.Trainer(max_epochs=10,  accelerator="gpu", num_nodes=nnodes)
    -# 5. Train the model
    -trainer.fit(model, datamodule=datamodule)
    -# 6. Save the model!
    -trainer.save_checkpoint("image_classification_model.pt")
    -
    -
    -

    DDP training

    -

    16 nodes and 4 GPU each

    -
    #!/bin/bash -x
    -#SBATCH --nodes=16                     # This needs to match Trainer(num_nodes=...)
    -#SBATCH --gres=gpu:4                   # Use the 4 GPUs available
    -#SBATCH --ntasks-per-node=4            # When using pl it should always be set to 4
    -#SBATCH --cpus-per-task=24             # Divide the number of cpus (96) by the number of GPUs (4)
    -#SBATCH --time=00:15:00
    -#SBATCH --partition=dc-gpu
    -#SBATCH --account=training2425
    -#SBATCH --output=%j.out
    -#SBATCH --error=%j.err
    -#SBATCH --reservation=training2425 
    -
    -export CUDA_VISIBLE_DEVICES=0,1,2,3    # Very important to make the GPUs visible
    -export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"
    -
    -source $HOME/course/$USER/sc_venv_template/activate.sh
    -time srun python3 ddp_training.py
    -
    real    6m56.457s
    -
    -
    -

    DDP training

    -

    With 4 nodes:

    -
    real    24m48.169s
    -

    With 8 nodes:

    -
    real    13m10.722s
    -

    With 16 nodes:

    -
    real    6m56.457s
    -

    With 32 nodes:

    -
    real    4m48.313s
    -
    -
    -

    Data Parallel

    - -
      -
    • It was
    • -
    • trainer = pl.Trainer(max_epochs=10,  accelerator="gpu")
    • -
    • Became
    • -
    • nnodes = os.getenv("SLURM_NNODES")
      -trainer = pl.Trainer(max_epochs=10,  accelerator="gpu", num_nodes=nnodes)
    • -
    -
    -
    -

    Data Parallel

    - -
      -
    • It was
    • -
    • #SBATCH --nodes=1                
      -#SBATCH --gres=gpu:1
      -#SBATCH --ntasks-per-node=1
      -#SBATCH --cpus-per-task=96
    • -
    • Became
    • -
    • #SBATCH --nodes=16                   # This needs to match Trainer(num_nodes=...)
      -#SBATCH --gres=gpu:4                 # Use the 4 GPUs available
      -#SBATCH --ntasks-per-node=4          # When using pl it should always be set to 4
      -#SBATCH --cpus-per-task=24           # Divide the number of cpus (96) by the number of GPUs (4)
      -export CUDA_VISIBLE_DEVICES=0,1,2,3  # Very important to make the GPUs visible
    • -
    -
    -
    -

    DEMO

    -

    TensorBoard

    • In resnet50.py
    • -
    • self.log("training_loss", train_loss)
    • +
    • self.log("training_loss", train_loss)

    TensorBoard

    -
    source $HOME/course/$USER/sc_venv_template/activate.sh
    -tensorboard --logdir=[PATH_TO_TENSOR_BOARD] 
    +
    source $HOME/course/$USER/sc_venv_template/activate.sh
    +tensorboard --logdir=[PATH_TO_TENSOR_BOARD] 

    @@ -809,7 +877,7 @@

    ANY QUESTIONS??

    Feedback is more than welcome!

    diff --git a/public/02-speedup-data-loading.html b/public/02-speedup-data-loading.html deleted file mode 100644 index 2a937e4..0000000 --- a/public/02-speedup-data-loading.html +++ /dev/null @@ -1,852 +0,0 @@ - - - - - - - - Bringing Deep Learning Workloads to JSC supercomputers - - - - - - - - - - -
    -
    - -
    -

    Bringing Deep Learning Workloads to JSC -supercomputers

    -

    Data loading

    -

    Alexandre Strube // Sabrina Benassou

    -

    June 25, 2024

    -
    - -
    - -

    Schedule for day 2

    -
    10:00 - 10:1509:00 - 09:15 Welcome
    10:15 - 11:0009:15 - 10:00 Introduction
    11:00 - 11:1511:00 - 10:15 Coffee break
    11:16 - 11:3010:16 - 10:30 Judoor, Keys
    11:30 - 12:00SSH, Jupyter, VS Code10:30 - 11:00Jupyter-JSC
    12:00 - 12:1511:00 - 11:15 Coffee Break
    12:15 - 13:0011:15 - 12:00 Running services on the login and compute nodes
    13:00 - 13:1512:00 - 12:15 Coffee Break
    13:30 - 14:0012:30 - 13:00 Sync (everyone should be at the same point)
    - - - - - - - - - - - - - - - - - - - - - - - - -
    TimeTitle
    10:00 - 10:15Welcome, questions
    10:15 - 11:30Data loading
    11:30 - 12:00Coffee Break (flexible)
    12:30 - 14:00Parallelize Training
    -
    -
    -

    Let’s talk about DATA

    -
      -
    • Some general considerations one should have in -mind
    • -
    -
    -
    - -
    -Not this data - -
    -
    -
    -

    I/O is separate and shared

    -

    All -compute nodes of all supercomputers see the same files

    -
      -
    • Performance tradeoff between shared acessibility -and speed
    • -
    • It’s simple to load data fast to 1 or 2 gpus. But -to 100? 1000? 10000?
    • -
    -
    -
    - -

    Jülich Supercomputers

    -
      -
    • Our I/O server is almost a supercomputer by -itself
    • -
    • -JSC Supercomputer Stragegy - -
    • -
    -
    -
    -

    Where do I keep my files?

    -
      -
    • $PROJECT_projectname -for code (projectname is training2425 in this -case) -
        -
      • Most of your work should stay here
      • -
    • -
    • $DATA_projectname for -big data(*) -
        -
      • Permanent location for big datasets
      • -
    • -
    • $SCRATCH_projectname -for temporary files (fast, but not permanent) -
        -
      • Files are deleted after 90 days untouched
      • -
    • -
    -
    -
    -

    Data services

    -
      -
    • JSC provides different data services
    • -
    • Data projects give massive amounts of storage
    • -
    • We use it for ML datasets. Join the project at -Judoor
    • -
    • After being approved, connect to the supercomputer -and try it:
    • -
    • cd $DATA_datasets
      -ls -la
    • -
    -
    -
    -

    Data Staging

    -
      -
    • LARGEDATA -filesystem is not accessible by compute nodes -
        -
      • Copy files to an accessible filesystem BEFORE -working
      • -
    • -
    • Imagenet-21K copy alone takes 21+ minutes to -$SCRATCH -
        -
      • We already copied it to $SCRATCH for you
      • -
    • -
    -
    -
    -

    Data loading

    -
    -Fat GPUs need to be fed FAST - -
    -
    -
    -

    Strategies

    -
      -
    • We have CPUs and lots of memory - let’s use them -
        -
      • multitask training and data loading for the next -batch
      • -
      • /dev/shm is a filesystem on ram - -ultra fast ⚡️
      • -
    • -
    • Use big files made for parallel computing -
        -
      • HDF5, Zarr, mmap() in a parallel fs, LMDB
      • -
    • -
    • Use specialized data loading libraries -
        -
      • FFCV, DALI, Apache Arrow
      • -
    • -
    • Compression sush as squashfs -
        -
      • data transfer can be slower than decompression -(must be checked case by case)
      • -
      • Beneficial in cases where numerous small files are -at hand.
      • -
    • -
    -
    -
    -

    Libraries

    - -
    -
    -

    We need to download some code

    -
    cd $HOME/course
    -git clone https://github.com/HelmholtzAI-FZJ/2024-06-course-Bringing-Deep-Learning-Workloads-to-JSC-supercomputers.git
    -
    -
    -

    The ImageNet dataset

    -

    Large Scale -Visual Recognition Challenge (ILSVRC)

    -
      -
    • An image dataset organized according to the WordNet hierarchy.
    • -
    • Extensively used in algorithms for object detection -and image classification at large scale.
    • -
    • It has 1000 classes, that comprises 1.2 million -images for training, and 50,000 images for the validation set.
    • -
    -

    -
    -
    -

    The ImageNet dataset

    -
    ILSVRC
    -|-- Data/
    -    `-- CLS-LOC
    -        |-- test
    -        |-- train
    -        |   |-- n01440764
    -        |   |   |-- n01440764_10026.JPEG
    -        |   |   |-- n01440764_10027.JPEG
    -        |   |   |-- n01440764_10029.JPEG
    -        |   |-- n01695060
    -        |   |   |-- n01695060_10009.JPEG
    -        |   |   |-- n01695060_10022.JPEG
    -        |   |   |-- n01695060_10028.JPEG
    -        |   |   |-- ...
    -        |   |...
    -        |-- val
    -            |-- ILSVRC2012_val_00000001.JPEG  
    -            |-- ILSVRC2012_val_00016668.JPEG  
    -            |-- ILSVRC2012_val_00033335.JPEG      
    -            |-- ...
    -
    -
    -

    The ImageNet dataset

    -

    imagenet_train.json

    -
    {
    -    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_8050.JPEG': 524,
    -    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_12728.JPEG': 524,
    -    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_9736.JPEG': 524,
    -    ...
    -    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_7460.JPEG': 524,
    -    ...
    - }
    -

    imagenet_val.json

    -
    {
    -    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00008838.JPEG': 785,
    -    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00008555.JPEG': 129,
    -    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00028410.JPEG': 968,
    -    ...
    -    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00016007.JPEG': 709,
    - }
    -
    -
    -

    Access File System

    -
    def __getitem__(self, idx):
    -    x = Image.open(os.path.join(self.root, self.samples[idx])).convert("RGB")
    -    if self.transform:
    -        x = self.transform(x)
    -    return x, self.targets[idx]
    -   
    -
    -
    -

    Inodes

    -
      -
    • Inodes (Index Nodes) are data structures that store -metadata about files and directories.
    • -
    • Unique identification of files and directories -within the file system.
    • -
    • Efficient management and retrieval of file -metadata.
    • -
    • Essential for file operations like opening, -reading, and writing.
    • -
    • Limitations: -
        -
      • Fixed Number: Limited number of -inodes; no new files if exhausted, even with free disk space.
      • -
      • Space Consumption: Inodes consume -disk space, balancing is needed for efficiency.
      • -
    • -
    -
    -
    -

    Pyarrow File Creation

    -

    -
        binary_t = pa.binary()
    -    uint16_t = pa.uint16()
    -
    -
    -

    Pyarrow File Creation

    -

    -
        binary_t = pa.binary()
    -    uint16_t = pa.uint16()
    -
    -    schema = pa.schema([
    -        pa.field('image_data', binary_t),
    -        pa.field('label', uint16_t),
    -    ])
    -
    -
    -

    Pyarrow File Creation

    -

    -
        with pa.OSFile(
    -            os.path.join(args.target_folder, f'ImageNet-{split}.arrow'),
    -            'wb',
    -    ) as f:
    -        with pa.ipc.new_file(f, schema) as writer:
    -
    -
    -

    Pyarrow File Creation

    -

    -
    
    -    with open(sample, 'rb') as f:
    -        img_string = f.read()
    -
    -    image_data = pa.array([img_string], type=binary_t)
    -    label = pa.array([label], type=uint16_t)
    -
    -    batch = pa.record_batch([image_data, label], schema=schema)
    -
    -    writer.write(batch)
    -
    -
    -

    Pyarrow File Creation

    -

    -
    
    -    with open(sample, 'rb') as f:
    -        img_string = f.read()
    -
    -    image_data = pa.array([img_string], type=binary_t)
    -    label = pa.array([label], type=uint16_t)
    -
    -    batch = pa.record_batch([image_data, label], schema=schema)
    -
    -    writer.write(batch)
    -
    -
    -

    Access Arrow File

    -
    -
    -

    -
    -
    -
    def __getitem__(self, idx):
    -    if self.arrowfile is None:
    -        self.arrowfile = pa.OSFile(self.data_root, 'rb')
    -        self.reader = pa.ipc.open_file(self.arrowfile)
    -
    -    row = self.reader.get_batch(idx)
    -
    -    img_string = row['image_data'][0].as_py()
    -    target = row['label'][0].as_py()
    -
    -    with io.BytesIO(img_string) as byte_stream:
    -        with Image.open(byte_stream) as img:
    -            img = img.convert("RGB")
    -
    -    if self.transform:
    -        img = self.transform(img)
    -
    -    return img, target
    -
    -
    -
    -
    -

    HDF5

    -

    -
    
    -with h5py.File(os.path.join(args.target_folder, 'ImageNet.h5'), "w") as f:
    -
    -
    -

    HDF5

    -
    -
    -
    
    -group = g.create_group(split)
    -
    -
    -

    -
    -
    -
    -
    -

    HDF5

    -
    -
    -
    dt_sample = h5py.vlen_dtype(np.dtype(np.uint8))
    -dt_target = np.dtype('int16')
    -
    -dset = group.create_dataset(
    -                'images',
    -                (len(samples),),
    -                dtype=dt_sample,
    -            )
    -
    -dtargets = group.create_dataset(
    -        'targets',
    -        (len(samples),),
    -        dtype=dt_target,
    -    )
    -
    -
    -

    -
    -
    -
    -
    -

    HDF5

    -

    -
    for idx, (sample, target) in tqdm(enumerate(zip(samples, targets))):        
    -    with open(sample, 'rb') as f:
    -        img_string = f.read() 
    -        dset[idx] = np.array(list(img_string), dtype=np.uint8)
    -        dtargets[idx] = target
    -
    -
    -

    HDF5

    -

    -
    for idx, (sample, target) in tqdm(enumerate(zip(samples, targets))):        
    -    with open(sample, 'rb') as f:
    -        img_string = f.read() 
    -        dset[idx] = np.array(list(img_string), dtype=np.uint8)
    -        dtargets[idx] = target
    -
    -
    -

    HDF5

    -

    -
    -
    -

    Access h5 File

    -
    def __getitem__(self, idx):
    -    if self.h5file is None:
    -        self.h5file = h5py.File(self.train_data_path, 'r')[self.split]
    -        self.imgs = self.h5file["images"]
    -        self.targets = self.h5file["targets"]
    -
    -    img_string = self.imgs[idx]
    -    target = self.targets[idx]
    -
    -    with io.BytesIO(img_string) as byte_stream:
    -        with Image.open(byte_stream) as img:
    -            img = img.convert("RGB")
    -
    -    if self.transform:
    -        img = self.transform(img)
    -        
    -    return img, target
    -
    -
    -

    DEMO

    -
    -
    -

    Exercise

    -
      -
    • Could you create an arrow file for the flickr -dataset stored in /p/scratch/training2402/data/Flickr30K/ -and read it using a dataloader ?
    • -
    -
    - - - - - - - - - - - - - diff --git a/public/images/GPUs.svg b/public/images/GPUs.svg new file mode 100644 index 0000000..84beac3 --- /dev/null +++ b/public/images/GPUs.svg @@ -0,0 +1,460 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +world + +size16 + + + + + + + + + + + + + + + + + + + +Node#1 + + + + + + + + + + + + + + + +GPU0 + + + + + + + + + + + + + + + + + +GPU1 + + + + + + + + + + + + + + + + + +GPU2 + + + + + + + + + + + + + + + + + +GPU3 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Node#2 + + + + + + + + + + + + + + + +GPU0 + + + + + + + + + + + + + + + + + +GPU1 + + + + + + + + + + + + + + + + + +GPU2 + + + + + + + + + + + + + + + + + +GPU3 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Node#3 + + + + + + + + + + + + + + + +GPU0 + + + + + + + + + + + + + + + + + +GPU1 + + + + + + + + + + + + + + + + + +GPU2 + + + + + + + + + + + + + + + + + +GPU3 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Node#4 + + + + + + + + + + + + + + + +GPU0 + + + + + + + + + + + + + + + + + +GPU1 + + + + + + + + + + + + + + + + + +GPU2 + + + + + + + + + + + + + + + + + +GPU3 + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/public/images/create-python-file.png b/public/images/create-python-file.png new file mode 100644 index 0000000..7517560 Binary files /dev/null and b/public/images/create-python-file.png differ diff --git a/public/images/jupyter-partition.png b/public/images/jupyter-partition.png index 4b1edc2..16ed359 100644 Binary files a/public/images/jupyter-partition.png and b/public/images/jupyter-partition.png differ diff --git a/public/images/open-editor-matrix-python.png b/public/images/open-editor-matrix-python.png new file mode 100644 index 0000000..149a99c Binary files /dev/null and b/public/images/open-editor-matrix-python.png differ diff --git a/public/images/open-new-file-jp.png b/public/images/open-new-file-jp.png new file mode 100644 index 0000000..6d0900d Binary files /dev/null and b/public/images/open-new-file-jp.png differ diff --git a/public/images/rename-matrix-python-file.png b/public/images/rename-matrix-python-file.png new file mode 100644 index 0000000..906da0e Binary files /dev/null and b/public/images/rename-matrix-python-file.png differ diff --git a/public/pics/javad.jpg b/public/pics/javad.jpg new file mode 100644 index 0000000..e7999ef Binary files /dev/null and b/public/pics/javad.jpg differ