Skip to content

Commit

Permalink
Adding Nsight system as a profiling tool that can be installed at run…
Browse files Browse the repository at this point in the history
…time
  • Loading branch information
Lokiiiiii committed Jan 17, 2025
1 parent 93eeb27 commit a530166
Show file tree
Hide file tree
Showing 6 changed files with 241 additions and 11 deletions.
61 changes: 61 additions & 0 deletions TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# **Troubleshooting Guide**

This guide provides steps and information to troubleshoot issues related to the model server and debugging tools. It is a work in progress and will eventually be moved to `serving/docs` upon finalization.

---

## **Profiling**

> Note that profiling is still being worked upon and the interfaces are bound to change until finalized. In it's current state this is only recommended for personal debugging.
The container can be started in **DEBUG mode** by setting the environment variable `DEBUG_MODE=1`. When enabled, this mode facilitates advanced profiling and debugging capabilities with the following effects:

### **1. Installation of Debugging Tools**

In DEBUG mode, the following tool will be installed automatically:

- **[NVIDIA Nsight Systems](https://docs.nvidia.com/nsight-systems/)**
- Nsight Systems enables system-wide performance analysis.
- The version of Nsight can be controlled using the environment variable:
- `NSIGHT_VERSION`: Specifies the version of Nsight Systems to install (e.g., `2024.6.1`).

### **2. Profiling with Nsight Systems**

The model server will automatically start under the `nsys` profiler when `DEBUG_MODE` is enabled. The following environment variables can be configured to customize the profiling behavior:

- **`NSYS_PROFILE_DELAY`**:
- Specifies the delay in seconds before profiling begins.
- Use this to exclude startup activities and capture only relevant information.
- **Default**: `30` seconds.

- **`NSYS_PROFILE_DURATION`**:
- Specifies the duration in seconds for profiling.
- Avoid setting this to values larger than 600 seconds (10 minutes) to prevent generating large and unwieldy reports.
- **Default**: `600` seconds.

- **`NSYS_PROFILE_TRACE`**:
- Allows customization of the APIs and operations to trace.
- Examples include `cuda`, `nvtx`, `osrt`, `cudnn`, `cublas`, `mpi`, and `python-gil`.
- Refer to the [Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) for more details.

### **3. Report Generation and Upload**

- After profiling is complete, the generated `.nsys-rep` report will be automatically uploaded to the specified S3 bucket if the `S3_DEBUG_PATH` environment variable is provided.
- **`S3_DEBUG_PATH`**:
- Specifies the S3 bucket and path for storing the profiling report.
- **Example**: `s3://my-bucket/profiles/`.

---

### **Example Usage**

To enable profiling and customize its behavior:

```bash
DEBUG_MODE=1 \
NSIGHT_VERSION=2024.6.1 \
NSYS_PROFILE_DELAY=20 \
NSYS_PROFILE_DURATION=300 \
NSYS_PROFILE_TRACE="cuda,nvtx,osrt" \
S3_DEBUG_PATH="s3://my-bucket/debug-reports/" \
docker run my-container
17 changes: 12 additions & 5 deletions serving/docker/dockerd-entrypoint-with-cuda-compat.sh
Original file line number Diff line number Diff line change
Expand Up @@ -60,11 +60,18 @@ fi

if [[ "$1" = "serve" ]]; then
shift 1
code=77
while [[ code -eq 77 ]]; do
/usr/bin/djl-serving "$@"
code=$?
done
echo "$DEBUG_MODE=$DEBUG_MODE"
if [[ -n "$DEBUG_MODE" ]]; then
set -e
source /opt/djl/scripts/install_debug_tools.sh
/opt/djl/scripts/start_debug_tools.sh "$@"
else
code=77
while [[ code -eq 77 ]]; do
/usr/bin/djl-serving "$@"
code=$?
done
fi
elif [[ "$1" = "partition" ]] || [[ "$1" = "train" ]]; then
shift 1
/usr/bin/python3 /opt/djl/partition/partition.py "$@"
Expand Down
19 changes: 13 additions & 6 deletions serving/docker/dockerd-entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,19 @@ fi

if [[ "$1" = "serve" ]]; then
shift 1
code=77
while [[ code -eq 77 ]]; do
/usr/bin/djl-serving "$@"
code=$?
done
exit $code
echo "$DEBUG_MODE=$DEBUG_MODE"
if [[ -n "$DEBUG_MODE" ]]; then
set -e
source /opt/djl/scripts/install_debug_tools.sh
/opt/djl/scripts/start_debug_tools.sh "$@"
else
code=77
while [[ code -eq 77 ]]; do
/usr/bin/djl-serving "$@"
code=$?
done
exit $code
fi
elif [[ "$1" = "partition" ]] || [[ "$1" = "train" ]]; then
set -e
shift 1
Expand Down
4 changes: 4 additions & 0 deletions serving/docker/scripts/install_debug_tools.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/usr/bin/env bash
set -e

source /opt/djl/scripts/install_nsys.sh
90 changes: 90 additions & 0 deletions serving/docker/scripts/install_nsys.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#!/usr/bin/env bash

# Define the base URL for Nsight Systems
BASE_URL="https://developer.download.nvidia.com/devtools/nsight-systems/"

# Check for NSIGHT_VERSION
if [ -n "${NSIGHT_VERSION}" ]; then
# Check if the variable contains only numbers, dots, and hyphens
echo "NSIGHT_VERSION is set: ${NSIGHT_VERSION}"
else
# Find the latest version dynamically
echo "Fetching the latest Nsight Systems version..."
NSIGHT_VERSION=$(wget -qO- "$BASE_URL" | grep -oP 'NsightSystems-linux-public-\K[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+-[0-9]+' | sort -V | tail -1)

if [ -z "$NSIGHT_VERSION" ]; then
echo "Failed to fetch the latest version. Exiting."
exit 1
fi

echo "Latest Nsight Systems version found: $NSIGHT_VERSION"
fi

# Security Validation
if [[ "${NSIGHT_VERSION}" =~ ^[0-9.-]+$ ]]; then
echo "NSIGHT_VERSION is valid: ${NSIGHT_VERSION}"
else
echo "NSIGHT_VERSION is invalid: ${NSIGHT_VERSION}"
exit 1
fi

# Construct the download URL
DOWNLOAD_URL="${BASE_URL}NsightSystems-linux-public-${NSIGHT_VERSION}.run"

# Define the installation directory (default is /opt/nvidia/nsight-systems)
INSTALL_DIR="/opt/nvidia/nsight-systems"

# Update and install prerequisites
echo "Updating system and installing prerequisites..."
apt-get update
apt-get install -y wget build-essential aria2 expect

# Download Nsight Systems installer
echo "Downloading Nsight Systems ${NSIGHT_VERSION}..."
aria2c -x 16 "$DOWNLOAD_URL" -o nsight-systems-installer.run

# Verify the download
if [ ! -f "nsight-systems-installer.run" ]; then
echo "Download failed. Exiting."
exit 1
fi

# Make the installer executable
echo "Making the installer executable..."
chmod +x nsight-systems-installer.run

# Run the installer
echo "Running the Nsight Systems installer..."
# The installer is not respecting the CLI arguments
expect <<EOF
spawn ./nsight-systems-installer.run --quiet --accept --target ${INSTALL_DIR}
# Send ENTER and ACCEPT without waiting for specific prompts
send "\r"
sleep 1
send "ACCEPT\r"
sleep 1
send "${INSTALL_DIR}\r"
expect eof
EOF

# Add Nsight Systems to PATH
echo "Adding Nsight Systems to PATH..."
export PATH="${INSTALL_DIR}/pkg/bin:${PATH}"
echo "export PATH=${INSTALL_DIR}/pkg/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc

# Verify installation
echo "Verifying Nsight Systems installation..."
if command -v nsys &>/dev/null; then
echo "Nsight Systems installed successfully!"
nsys --version
else
echo "Nsight Systems installation failed."
exit 1
fi

# Clean up
echo "Cleaning up installer..."
rm -f nsight-systems-installer.run

echo "Installation complete. You can now use Nsight Systems with the 'nsys' command."
61 changes: 61 additions & 0 deletions serving/docker/scripts/start_debug_tools.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/usr/bin/env bash
set -e

# Function to validate numeric variables
validate_numeric_variable() {
local var_name="$1"
local var_value="$2"

if [[ "${var_value}" =~ ^[0-9]+$ ]]; then
echo "${var_name} is valid: ${var_value}"
else
echo "${var_name} is invalid: ${var_value}"
exit 1
fi
}

# Delay for start of profile capture to avoid profiling unintended setup steps
NSYS_PROFILE_DELAY=${NSYS_PROFILE_DELAY:-30}
# Security Validation
validate_numeric_variable "NSYS_PROFILE_DELAY" "${NSYS_PROFILE_DELAY}"

# Duration for profile capture to avoid diluting the profile.
NSYS_PROFILE_DURATION=${NSYS_PROFILE_DURATION:-600}
# Security Validation
validate_numeric_variable "NSYS_PROFILE_DURATION" "${NSYS_PROFILE_DURATION}"

# Duration for profile capture to avoid diluting the profile.
NSYS_PROFILE_TRACE=${NSYS_PROFILE_TRACE:-"cuda,nvtx,osrt,cudnn,cublas,mpi,python-gil"}
# Security Validation
if [[ "$NSYS_PROFILE_TRACE" =~ ^[a-z0-9,-]+$ ]]; then
echo "NSYS_PROFILE_TRACE is valid: ${NSYS_PROFILE_TRACE}"
else
echo "NSYS_PROFILE_TRACE is invalid: ${NSYS_PROFILE_TRACE}"
echo "Only lowercase letters, numbers, commas, and hyphens are allowed."
exit 1
fi

if [ -n "${S3_DEBUG_PATH}" ]; then
# Validate the S3 path format
if [[ ! "$S3_DEBUG_PATH" =~ ^s3://[a-z0-9.\-]+(/([a-zA-Z0-9.\-_]+)*)?/$ ]]; then
echo "Error: S3_DEBUG_PATH must be of the format s3://bucket/key/"
exit 1
fi
fi

nsys profile \
--kill=sigkill \
--wait=primary \
--show-output true \
--osrt-threshold 10000 \
--delay "${NSYS_PROFILE_DELAY}" \
--duration "${NSYS_PROFILE_DURATION}" \
--python-backtrace=cuda \
--trace "${NSYS_PROFILE_TRACE}" \
--cudabacktrace all:10000 \
--output "$(hostname).nsys-rep" \
-- djl-serving "$@" || true # Nsys exits with non-zero code when the application is terminated due to a timeout which is expected

if [ -n "${S3_DEBUG_PATH}" ]; then
s5cmd cp /opt/djl/*.nsys-rep "$S3_DEBUG_PATH"
fi

0 comments on commit a530166

Please sign in to comment.