From 34fbab88a343ec8d349f543b96c5d3d2daf39023 Mon Sep 17 00:00:00 2001
From: Chris Endemann <endemann@wisc.edu>
Date: Wed, 6 Nov 2024 18:51:59 -0600
Subject: [PATCH] Update Training-models-in-SageMaker-notebooks.md

---
 .../Training-models-in-SageMaker-notebooks.md | 335 +-----------------
 1 file changed, 9 insertions(+), 326 deletions(-)

diff --git a/episodes/Training-models-in-SageMaker-notebooks.md b/episodes/Training-models-in-SageMaker-notebooks.md
index c0742ca..225e653 100644
--- a/episodes/Training-models-in-SageMaker-notebooks.md
+++ b/episodes/Training-models-in-SageMaker-notebooks.md
@@ -773,335 +773,18 @@ print(f"Runtime for training on SageMaker: {end2 - start2:.2f} seconds, instance
 
 * Distributed algorithms: XGBoost has a built-in distributed training capability, but models that perform gradient descent, like deep neural networks, gain more obvious benefits because each instance can compute gradients for a batch of data simultaneously, allowing faster convergence.
 
-## Training a neural network with SageMaker
-Let's see how to do a similar experiment, but this time using PyTorch neural networks. We will again demonstrate how to test our custom model train script (train_nn.py) before deploying to SageMaker, and discuss some strategies (e.g., using a GPU) for improving train time when needed.
-
-### Preparing the data (compressed npz files)
-When deploying a PyTorch model on SageMaker, it's helpful to prepare the input data in a format that's directly accessible and compatible with PyTorch's data handling methods. The next code cell will prep our npz files from the existing csv versions. 
-
-:::::::::::::::::::::::::::::::: callout
-#### Why are we using this file format? 
-
-1. **Optimized data loading**:  
-   The `.npz` format stores arrays in a compressed, binary format, making it efficient for both storage and loading. PyTorch can easily handle `.npz` files, especially in batch processing, without requiring complex data transformations during training.
-
-2. **Batch compatibility**:  
-   When training neural networks in PyTorch, it's common to load data in batches. By storing data in an `.npz` file, we can quickly load the entire dataset or specific parts (e.g., `X_train`, `y_train`) into memory and feed it to the PyTorch `DataLoader`, enabling efficient batched data loading.
-
-3. **Reduced I/O overhead in SageMaker**:  
-   Storing data in `.npz` files minimizes the I/O operations during training, reducing time spent on data handling. This is especially beneficial in cloud environments like SageMaker, where efficient data handling directly impacts training costs and performance.
-
-4. **Consistency and compatibility**:  
-   Using `.npz` files allows us to ensure consistency between training and validation datasets. Each file (`train_data.npz` and `val_data.npz`) stores the arrays in a standardized way that can be easily accessed by keys (`X_train`, `y_train`, `X_val`, `y_val`). This structure is compatible with PyTorch's `Dataset` class, making it straightforward to design custom datasets for training.
-
-5. **Support for multiple data types**:  
-   `.npz` files support storage of multiple arrays within a single file. This is helpful for organizing features and labels without additional code. Here, the `train_data.npz` file contains both `X_train` and `y_train`, keeping everything related to training data in one place. Similarly, `val_data.npz` organizes validation features and labels, simplifying file management.
-
-In summary, saving the data in `.npz` files ensures a smooth workflow from data loading to model training in PyTorch, leveraging SageMaker's infrastructure for a more efficient, structured training process.
-:::::::::::::::::::::::::::::::::::::::
-
-```python
-import pandas as pd
-from sklearn.model_selection import train_test_split
-from sklearn.preprocessing import StandardScaler, LabelEncoder
-import numpy as np
-
-# Load and preprocess the Titanic dataset
-df = pd.read_csv(train_filename)
-
-# Encode categorical variables and normalize numerical ones
-df['Sex'] = LabelEncoder().fit_transform(df['Sex'])
-df['Embarked'] = df['Embarked'].fillna('S')  # Fill missing values in 'Embarked'
-df['Embarked'] = LabelEncoder().fit_transform(df['Embarked'])
-
-# Fill missing values for 'Age' and 'Fare' with median
-df['Age'] = df['Age'].fillna(df['Age'].median())
-df['Fare'] = df['Fare'].fillna(df['Fare'].median())
-
-# Select features and target
-X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']].values
-y = df['Survived'].values
-
-# Normalize features (helps avoid exploding/vanishing gradients)
-scaler = StandardScaler()
-X = scaler.fit_transform(X)
-
-# Split the data
-X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
-
-# Save the preprocessed data to our local jupyter environment
-np.savez('train_data.npz', X_train=X_train, y_train=y_train)
-np.savez('val_data.npz', X_val=X_val, y_val=y_val)
-
-```
-
-Next, we will upload our compressed files to our S3 bucket. Storage is farily cheap on AWS (around $0.023 per GB per month), but be mindful of uploading too much data. It may be convenient to store a preprocessed version of the data, just don't store too many versions that aren't being actively used.
-
-
-```python
-import boto3
-
-train_file = "train_data.npz"  # Local file path in your notebook environment
-val_file = "val_data.npz"  # Local file path in your notebook environment
-
-# Initialize the S3 client
-s3 = boto3.client('s3')
-
-# Upload the training and validation files to S3
-s3.upload_file(train_file, bucket_name, f"{train_file}")
-s3.upload_file(val_file, bucket_name, f"{val_file}")
-
-print("Files successfully uploaded to S3.")
-
-```
-
-    Files successfully uploaded to S3.
-
-
-#### Testing our train script on notebook instance
-You should always test code thoroughly before scaling up and using more resources. Here, we will test our script using a small number of epochs — just to verify our setup is correct.
-
-
-```python
-import torch
-
-# Measure training time locally
-start_time = t.time()
-%run  test_AWS/scripts/train_nn.py --train train_data.npz --val val_data.npz --epochs 1000 --learning_rate 0.001
-print(f"Local training time: {t.time() - start_time:.2f} seconds, instance_type = {local_instance}")
-
-```
-
-
-### Deploying PyTorch Neural Network via SageMaker
-Now that we have tested things locally, we can try to train with a larger number of epochs and a better instance selected. We can do this easily by invoking the PyTorch estimator. Our notebook is currently configured to use ml.m5.large. We can upgrade this to `ml.m5.xlarge` with the below code (using our notebook as a controller). 
-
-**Should we use a GPU?**: Since this dataset is farily small, we don't necessarily need a GPU for training. Considering costs, the m5.xlarge is `$0.17/hour`, while the cheapest GPU instance is `$0.75/hour`. However, for larger datasets (> 1 GB) and models, we may want to consider a GPU if training time becomes cumbersome (see [Instances for ML](https://docs.google.com/spreadsheets/d/1uPT4ZAYl_onIl7zIjv5oEAdwy4Hdn6eiA9wVfOBbHmY/edit?usp=sharing). If that doesn't work, we can try distributed computing (setting instance > 1). More on this in the next section.
-
-
-```python
-from sagemaker.pytorch import PyTorch
-from sagemaker.inputs import TrainingInput
-
-epochs = 10000
-instance_count = 1
-instance_type="ml.m5.large"
-output_path = f's3://{bucket_name}/output_nn/' # this folder will auto-generate if it doesn't exist already
-
-# Define the PyTorch estimator and pass hyperparameters as arguments
-pytorch_estimator = PyTorch(
-    entry_point="test_AWS/scripts/train_nn.py",
-    role=role,
-    instance_type=instance_type, # with this small dataset, we don't recessarily need a GPU for fast training. 
-    instance_count=instance_count,  # Distributed training with two instances
-    framework_version="1.9",
-    py_version="py38",
-    output_path=output_path,
-    sagemaker_session=session,
-    hyperparameters={
-        "train": "/opt/ml/input/data/train/train_data.npz",  # SageMaker will mount this path
-        "val": "/opt/ml/input/data/val/val_data.npz",        # SageMaker will mount this path
-        "epochs": epochs,
-        "learning_rate": 0.001
-    }
-)
-
-# Define input paths
-train_input = TrainingInput(f"s3://{bucket_name}/train_data.npz", content_type="application/x-npz")
-val_input = TrainingInput(f"s3://{bucket_name}/val_data.npz", content_type="application/x-npz")
-
-# Start the training job and time it
-start = t.time()
-pytorch_estimator.fit({"train": train_input, "val": val_input})
-end = t.time()
-
-print(f"Runtime for training on SageMaker: {end - start:.2f} seconds, instance_type: {instance_type}, instance_count: {instance_count}")
-
-```
-    
-    2024-11-03 21:27:03 Uploading - Uploading generated training model
-    2024-11-03 21:27:03 Completed - Training job completed
-    Training seconds: 135
-    Billable seconds: 135
-    Runtime for training on SageMaker: 197.62 seconds, instance_type: ml.m5.large, instance_count: 1
-
-
-### Deploying PyTorch Neural Network via SageMaker with a GPU Instance
-
-In this section, we'll implement the same procedure as above, but using a GPU-enabled instance for potentially faster training. While GPU instances are more expensive, they can be cost-effective for larger datasets or more complex models that require significant computational power.
-
-#### Selecting a GPU Instance
-For a small dataset like ours, we don't strictly need a GPU, but for larger datasets or more complex models, a GPU can reduce training time. Here, we'll select an `ml.g4dn.xlarge` instance, which provides a single GPU and costs approximately `$0.75/hour` (check [Instances for ML](https://docs.google.com/spreadsheets/d/1uPT4ZAYl_onIl7zIjv5oEAdwy4Hdn6eiA9wVfOBbHmY/edit?usp=sharing) for detailed pricing).
-
-#### Code Modifications for GPU Use
-Using a GPU requires minor changes in your training script (`train_nn.py`). Specifically, you'll need to:
-1. Check for GPU availability in PyTorch.
-2. Move the model and tensors to the GPU device if available.
-
-#### Enabling PyTorch to use GPU in `train_nn.py`  
-
-The following code snippet to enables GPU support in `train_nn.py`:
-
-```python
-import torch
-
-# Set device
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-```
-
-
-```python
-from sagemaker.pytorch import PyTorch
-from sagemaker.inputs import TrainingInput
-import time as t
-
-epochs = 10000
-instance_count = 1
-instance_type="ml.g4dn.xlarge"
-output_path = f's3://{bucket_name}/output_nn/'
-
-# Define the PyTorch estimator and pass hyperparameters as arguments
-pytorch_estimator_gpu = PyTorch(
-    entry_point="test_AWS/scripts/train_nn.py",
-    role=role,
-    instance_type=instance_type,
-    instance_count=instance_count,
-    framework_version="1.9",
-    py_version="py38",
-    output_path=output_path,
-    sagemaker_session=session,
-    hyperparameters={
-        "train": "/opt/ml/input/data/train/train_data.npz",
-        "val": "/opt/ml/input/data/val/val_data.npz",
-        "epochs": epochs,
-        "learning_rate": 0.001
-    }
-)
-
-# Define input paths
-train_input = TrainingInput(f"s3://{bucket_name}/train_data.npz", content_type="application/x-npz")
-val_input = TrainingInput(f"s3://{bucket_name}/val_data.npz", content_type="application/x-npz")
-
-# Start the training job and time it
-start = t.time()
-pytorch_estimator_gpu.fit({"train": train_input, "val": val_input})
-end = t.time()
-print(f"Runtime for training on SageMaker: {end - start:.2f} seconds, instance_type: {instance_type}, instance_count: {instance_count}")
-
-```
-    
-    2024-11-03 21:33:56 Uploading - Uploading generated training model
-    2024-11-03 21:33:56 Completed - Training job completed
-    Training seconds: 350
-    Billable seconds: 350
-    Runtime for training on SageMaker: 409.68 seconds, instance_type: ml.g4dn.xlarge, instance_count: 1
-
-
-#### GPUs can be slow for small datasets/models
-> This performance discrepancy might be due to the following factors:
-> 
-> 1. **Small Dataset/Model Size**: When datasets and models are small, the overhead of transferring data between the CPU and GPU, as well as managing the GPU, can actually slow things down. For very small models and datasets, CPUs are often faster since there's minimal data to process.
-> 
-> 2. **GPU Initialization Overhead**: Every time a training job starts on a GPU, there's a small overhead for initializing CUDA libraries. For short jobs, this setup time can make the GPU appear slower overall.
-> 
-> 3. **Batch Size**: GPUs perform best with larger batch sizes since they can process many data points in parallel. If the batch size is too small, the GPU is underutilized, leading to suboptimal performance. You may want to try increasing the batch size to see if this reduces training time.
-> 
-> 4. **Instance Type**: Some GPU instances, like the `ml.g4dn` series, have less computational power than the larger `p3` series. They're better suited for inference or lightweight tasks rather than intense training, so a more powerful instance (e.g., `ml.p3.2xlarge`) could help for larger tasks.
-> 
-> If training time continues to be critical, sticking with a CPU instance may be the best approach for smaller datasets. For larger, more complex models and datasets, the GPU's advantages should become more apparent.
-
-### Distributed Training for Neural Networks in SageMaker
-In the event that you do need distributed computing to achieve reasonable train times (remember to try an upgraded instance first!), simply adjust the instance count to a number between 2 and 5. Beyond 5 instances, you'll see diminishing returns and may be needlessly spending extra money/compute-energy.
-
-
-```python
-from sagemaker.pytorch import PyTorch
-from sagemaker.inputs import TrainingInput
-import time as t
-
-epochs = 10000
-instance_count = 2 # increasing to 2 to see if it has any benefit (likely won't see any with this small dataset)
-instance_type="ml.m5.xlarge"
-output_path = f's3://{bucket_name}/output_nn/'
-
-# Define the PyTorch estimator and pass hyperparameters as arguments
-pytorch_estimator = PyTorch(
-    entry_point="test_AWS/scripts/train_nn.py",
-    role=role,
-    instance_type=instance_type, # with this small dataset, we don't recessarily need a GPU for fast training. 
-    instance_count=instance_count,  # Distributed training with two instances
-    framework_version="1.9",
-    py_version="py38",
-    output_path=output_path,
-    sagemaker_session=session,
-    hyperparameters={
-        "train": "/opt/ml/input/data/train/train_data.npz",  # SageMaker will mount this path
-        "val": "/opt/ml/input/data/val/val_data.npz",        # SageMaker will mount this path
-        "epochs": epochs,
-        "learning_rate": 0.001
-    }
-)
-
-# Define input paths
-train_input = TrainingInput(f"s3://{bucket_name}/train_data.npz", content_type="application/x-npz")
-val_input = TrainingInput(f"s3://{bucket_name}/val_data.npz", content_type="application/x-npz")
-
-# Start the training job and time it
-start = t.time()
-pytorch_estimator.fit({"train": train_input, "val": val_input})
-end = t.time()
-
-print(f"Runtime for training on SageMaker: {end - start:.2f} seconds, instance_type: {instance_type}, instance_count: {instance_count}")
-
-```
-    
-    2024-11-03 21:36:35 Uploading - Uploading generated training model
-    2024-11-03 21:36:47 Completed - Training job completed
-    Training seconds: 228
-    Billable seconds: 228
-    Runtime for training on SageMaker: 198.36 seconds, instance_type: ml.m5.xlarge, instance_count: 2
-
-
-### Distributed Training for Neural Networks in SageMaker: Understanding Training Strategies and How Epochs Are Managed
-Amazon SageMaker provides two main strategies for distributed training: **data parallelism** and **model parallelism**. Understanding which strategy will be used depends on the model size and the configuration of your SageMaker training job, as well as the default settings of the specific SageMaker Estimator you are using.
-
-#### 1. **Data Parallelism (Most Common for Mini-batch SGD)**
-   - **How it Works**: In data parallelism, each instance in the cluster (e.g., multiple `ml.m5.xlarge` instances) maintains a **complete copy of the model**. The **training dataset is split across instances**, and each instance processes a different subset of data simultaneously. This enables multiple instances to complete forward and backward passes on different data batches independently.
-   - **Epoch Distribution**: Even though each instance processes all the specified epochs, they only work on a portion of the dataset for each epoch. After each batch, instances synchronize their gradient updates across all instances using a method such as *all-reduce*. This ensures that while each instance is working with a unique data batch, the model weights remain consistent across instances.
-   - **Key Insight**: Because all instances process the specified number of epochs and synchronize weight updates between batches, each instance's training contributes to a cohesive, shared model. The **effective epoch count across instances appears to be shared** because data parallelism allows each instance to handle a fraction of the data per epoch, not the epochs themselves. Data parallelism is well-suited for models that can fit into a single instance's memory and benefit from increased data throughput.
-
-#### 2. **Model Parallelism (Best for Large Models)**
-   - **How it Works**: Model parallelism divides the model itself across multiple instances, not the data. This approach is best suited for very large models that cannot fit into a single GPU or instance's memory (e.g., large language models).
-   - **Epoch Distribution**: The model is partitioned so that each instance is responsible for specific layers or components. Data flows sequentially through these partitions, where each instance processes a part of each batch and passes it to the next instance.
-   - **Key Insight**: This approach is more complex due to the dependency between model components, so **synchronization occurs across the model layers rather than across data batches**. Model parallelism generally suits scenarios with exceptionally large model architectures that exceed memory limits of typical instances.
-
-### Determining Which Distributed Training Strategy is Used
-SageMaker will select the distributed strategy based on:
-   - **Framework and Estimator Configuration**: Most deep learning frameworks in SageMaker default to data parallelism, especially when using PyTorch or TensorFlow with standard configurations.
-   - **Model and Data Size**: If you specify a model that exceeds a single instance's memory capacity, SageMaker may switch to model parallelism if configured for it.
-   - **Instance Count**: When you specify `instance_count > 1` in your Estimator with a deep learning model, SageMaker will use data parallelism by default unless explicitly configured for model parallelism.
-
-You observed that each instance ran all epochs with `instance_count=2` and 10,000 epochs, which aligns with data parallelism. Here, each instance processed the full set of epochs independently, but each batch of data was different, and the gradient updates were synchronized across instances.
-
-
-### Summary of Key Points
-- **Data Parallelism** is the default distributed training strategy and splits the dataset across instances, allowing each instance to work on different data batches.
-   - Each instance runs all specified epochs, but the weight updates are synchronized, so **epoch workload is shared across the data** rather than by reducing epoch count per instance.
-- **Model Parallelism** splits the model itself across instances, typically only needed for very large models that exceed the memory capacity of single instances.
-- **Choosing Between Distributed Strategies**: Data parallelism is suitable for most neural network models, especially those that fit in memory, while model parallelism is intended for exceptionally large models with memory constraints.
-
-For cost optimization:
-- **Single-instance training** is typically more cost-effective for small or moderately sized datasets, while **multi-instance setups** can reduce wall-clock time for larger datasets and complex models, at a higher instance cost.
-- For **initial testing**, start with data parallelism on a single instance, and increase instance count if training time becomes prohibitive, while being mindful of communication overhead and scaling efficiency.
+### For cost optimization
+* Single-instance training is typically more cost-effective for small or moderately sized datasets, while **multi-instance setups** can reduce wall-clock time for larger datasets and complex models, at a higher instance cost.
+* For **initial testing**, start with data parallelism on a single instance, and increase instance count if training time becomes prohibitive, while being mindful of communication overhead and scaling efficiency.
 
 
 ::::::::::::::::::::::::::::::::::::: keypoints
 
-- **Environment Initialization**: Setting up a SageMaker session, defining roles, and specifying the S3 bucket are essential for managing data and running jobs in SageMaker.
-- **Local vs. Managed Training**: Local training in SageMaker notebooks can be useful for quick tests but lacks the scalability and resource management available in SageMaker-managed training.
-- **Estimator Classes**: SageMaker provides framework-specific Estimator classes (e.g., XGBoost, PyTorch, SKLearn) to streamline training setups, each suited to different model types and workflows.
-- **Custom Scripts vs. Built-in Images**: Custom training scripts offer flexibility with preprocessing and custom logic, while built-in images are optimized for rapid deployment and simpler setups.
-- **Training Data Channels**: Using `TrainingInput` ensures SageMaker manages data efficiently, especially for distributed setups where data needs to be synchronized across multiple instances.
-- **Distributed Training Options**: Data parallelism (splitting data across instances) is common for many models, while model parallelism (splitting the model across instances) is useful for very large models that exceed instance memory.
+- **Environment initialization**: Setting up a SageMaker session, defining roles, and specifying the S3 bucket are essential for managing data and running jobs in SageMaker.
+- **Local vs. managed training**: Always test your code locally (on a smaller scale) before scaling things up. This avoids wasting resources on buggy code that doesn't produce reliable results.
+- **Estimator classes**: SageMaker provides framework-specific Estimator classes (e.g., XGBoost, PyTorch, SKLearn) to streamline training setups, each suited to different model types and workflows.
+- **Custom scripts vs. built-in images**: Custom training scripts offer flexibility with preprocessing and custom logic, while built-in images are optimized for rapid deployment and simpler setups.
+- **Training data channels**: Using `TrainingInput` ensures SageMaker manages data efficiently, especially for distributed setups where data needs to be synchronized across multiple instances.
+- **Distributed training options**: Data parallelism (splitting data across instances) is common for many models, while model parallelism (splitting the model across instances) is useful for very large models that exceed instance memory.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::