A comprehensive repository for setting up distributed training workflows using PyTorch and Kubeflow, complete with metrics collection for GPUs, CPUs, memory, disk usage, and network throughput.
You may use the provided docker file or build your own image.
This repository provides a robust framework for distributed training with PyTorch using Kubeflow's PyTorchJob. It includes detailed metrics collection scripts for monitoring resource usage across GPUs, CPUs, memory, and disk space during training.
-
Distributed Training:
- Deploy distributed PyTorch training jobs with
PyTorchJob
CRDs on Kubernetes. - Leverages RDMA-enabled GPUs for optimized throughput.
- Deploy distributed PyTorch training jobs with
-
Metrics Collection:
- Logs GPU utilization, memory, and power.
- Tracks CPU and memory usage.
- Monitors disk and network throughput.
-
Flexible Configurations:
- Adjustable logging intervals.
- Supports fine-tuning and evaluation of large-scale datasets.
-
Efficient Resource Management:
- Utilizes advanced techniques like gradient checkpointing and mixed precision for VRAM optimization.
- Kubernetes cluster with Kubeflow installed
- NVIDIA GPUs with RDMA support and NCCL configured
- Python 3.8+
- PyTorch 2.3.1+cu121
- Persistent storage configured for training datasets and outputs
-
Clone this repository:
git clone [https://github.com/your-username/your-repo-name.git](https://github.com/bbenshab/KubeCon-RoCE.git) cd KubeCon-RoCE
-
Install the Kubeflow PyTorchJob CRD:
kubectl apply -f https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/manifests/pytorch-operator.yaml
-
Configure your Kubernetes environment for multi-node, RDMA-enabled training.
-
Define your
PyTorchJob
YAML configuration for training. Below is an example configuration:apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: pytorch-training namespace: default spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: Never template: spec: containers: - name: pytorch image: image-name env: - name: NCCL_DEBUG value: INFO - name: TRAINING_SCRIPT value: /mnt/storage/fms-hf-tuning/tuning/sft_trainer.py volumeMounts: - name: storage-volume mountPath: /mnt/storage Worker: replicas: 1 restartPolicy: Never template: spec: containers: - name: pytorch image: image-name volumeMounts: - name: storage-volume mountPath: /mnt/storage
-
Submit the PyTorchJob to your Kubernetes cluster:
kubectl apply -f pytorchjob.yaml
-
Monitor the job status:
kubectl get pytorchjobs
collect_metrics.sh will run during the training and print the statistics to the pod logs when training completes.
Monitored metrics include:
- GPU utilization, memory, and power usage
- CPU & memory utilization
- Disk and network throughput
This repository leverages multiple configurable environment variables to fine-tune the training setup. Below are the key variables extracted from the YAML configuration:
HF_HOME
: Temporary directory for Hugging Face cache (default:/tmp/output
).OUTPUT_DIR
: Directory where training outputs, checkpoints, and logs are saved.SHARED_PATH
: Path to a shared directory across distributed nodes (default:/mnt/shared
).TRAINING_SCRIPT
: Path to the training script (e.g.,/mnt/storage/fms-hf-tuning/tuning/sft_trainer.py
).MODEL_NAME_OR_PATH
: Path to the pre-trained model or model checkpoint for fine-tuning.TRAINING_DATA_PATH
: Path to the dataset file for training (e.g.,/mnt/storage/openwebtext.jsonl
).
NNODES
: Number of nodes participating in the training job.NPROC_PER_NODE
: Number of processes per node (typically matches the number of GPUs per node).MASTER_PORT
: Port used for communication between nodes.BACKEND
: Backend used for distributed training (e.g.,nccl
).
NUM_TRAIN_EPOCHS
: Number of epochs for training.LEARNING_RATE
: Learning rate for the optimizer (e.g.,2e-5
).LR_SCHEDULER_TYPE
: Type of learning rate scheduler (e.g.,cosine
).MAX_SEQ_LENGTH
: Maximum sequence length for tokenized input.PER_DEVICE_TRAIN_BATCH_SIZE
: Batch size per GPU.SAVE_STEPS
: Frequency (in steps) to save checkpoints and log metrics.
FSDP
: Configuration for Fully Sharded Data Parallelism (e.g.,full_shard
).FSDP_CONFIG
: Path to the FSDP configuration JSON file.TORCH_DTYPE
: Data type for PyTorch tensors (e.g.,bfloat16
for mixed precision training).PEFT_METHOD
: Fine-tuning method (e.g.,lora
).PEFT_METHOD_ENABLED
: Whether to enable the PEFT method (true
orfalse
).TARGET_MODULES
: Target modules for LoRA fine-tuning (e.g.,q_proj v_proj k_proj o_proj
).LORA_R
: LoRA rank (e.g.,8
).LORA_ALPHA
: LoRA scaling factor (e.g.,8
).LORA_DROPOUT
: LoRA dropout rate (e.g.,0.1
).
LOGGING_STRATEGY
: Logging strategy (steps
orepoch
).LOGGING_STEPS
: Frequency (in steps) for logging metrics.NUM_WORKERS
: Number of workers for data loading.
ENABLE_RDMA
: Whether to enable RDMA (true
orfalse
).NCCL_SOCKET_IFNAME
: Network interfaces for NCCL communication (e.g.,net1-0,net1-1
).NCCL_IB_HCA
: RDMA interfaces (e.g.,mlx5_3:1,mlx5_0:1
).NCCL_IB_GID_INDEX
: GID index for RDMA (e.g.,3
).
These environment variables can be adjusted in the YAML file under the env
section to customize the training configuration.
Output files include:
- Hardware metrics (
*-hwstat.csv
): Tracks GPU, CPU, and memory usage over time. - Metrics logs (
*-metrics.log
): Includes disk usage and storage information. - Training logs: Captures training progress, loss, and evaluation metrics.
kubectl apply -f pytorchjob.yaml
kubectl get pytorchjobs