-
Notifications
You must be signed in to change notification settings - Fork 68
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding Nsight system as a profiling tool that can be installed at run…
…time
- Loading branch information
Showing
6 changed files
with
241 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# **Troubleshooting Guide** | ||
|
||
This guide provides steps and information to troubleshoot issues related to the model server and debugging tools. It is a work in progress and will eventually be moved to `serving/docs` upon finalization. | ||
|
||
--- | ||
|
||
## **Profiling** | ||
|
||
> Note that profiling is still being worked upon and the interfaces are bound to change until finalized. In it's current state this is only recommended for personal debugging. | ||
The container can be started in **DEBUG mode** by setting the environment variable `DEBUG_MODE=1`. When enabled, this mode facilitates advanced profiling and debugging capabilities with the following effects: | ||
|
||
### **1. Installation of Debugging Tools** | ||
|
||
In DEBUG mode, the following tool will be installed automatically: | ||
|
||
- **[NVIDIA Nsight Systems](https://docs.nvidia.com/nsight-systems/)** | ||
- Nsight Systems enables system-wide performance analysis. | ||
- The version of Nsight can be controlled using the environment variable: | ||
- `NSIGHT_VERSION`: Specifies the version of Nsight Systems to install (e.g., `2024.6.1`). | ||
|
||
### **2. Profiling with Nsight Systems** | ||
|
||
The model server will automatically start under the `nsys` profiler when `DEBUG_MODE` is enabled. The following environment variables can be configured to customize the profiling behavior: | ||
|
||
- **`NSYS_PROFILE_DELAY`**: | ||
- Specifies the delay in seconds before profiling begins. | ||
- Use this to exclude startup activities and capture only relevant information. | ||
- **Default**: `30` seconds. | ||
|
||
- **`NSYS_PROFILE_DURATION`**: | ||
- Specifies the duration in seconds for profiling. | ||
- Avoid setting this to values larger than 600 seconds (10 minutes) to prevent generating large and unwieldy reports. | ||
- **Default**: `600` seconds. | ||
|
||
- **`NSYS_PROFILE_TRACE`**: | ||
- Allows customization of the APIs and operations to trace. | ||
- Examples include `cuda`, `nvtx`, `osrt`, `cudnn`, `cublas`, `mpi`, and `python-gil`. | ||
- Refer to the [Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) for more details. | ||
|
||
### **3. Report Generation and Upload** | ||
|
||
- After profiling is complete, the generated `.nsys-rep` report will be automatically uploaded to the specified S3 bucket if the `S3_DEBUG_PATH` environment variable is provided. | ||
- **`S3_DEBUG_PATH`**: | ||
- Specifies the S3 bucket and path for storing the profiling report. | ||
- **Example**: `s3://my-bucket/profiles/`. | ||
|
||
--- | ||
|
||
### **Example Usage** | ||
|
||
To enable profiling and customize its behavior: | ||
|
||
```bash | ||
DEBUG_MODE=1 \ | ||
NSIGHT_VERSION=2024.6.1 \ | ||
NSYS_PROFILE_DELAY=20 \ | ||
NSYS_PROFILE_DURATION=300 \ | ||
NSYS_PROFILE_TRACE="cuda,nvtx,osrt" \ | ||
S3_DEBUG_PATH="s3://my-bucket/debug-reports/" \ | ||
docker run my-container |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
#!/usr/bin/env bash | ||
set -e | ||
|
||
source /opt/djl/scripts/install_nsys.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
#!/usr/bin/env bash | ||
|
||
# Define the base URL for Nsight Systems | ||
BASE_URL="https://developer.download.nvidia.com/devtools/nsight-systems/" | ||
|
||
# Check for NSIGHT_VERSION | ||
if [ -n "${NSIGHT_VERSION}" ]; then | ||
# Check if the variable contains only numbers, dots, and hyphens | ||
echo "NSIGHT_VERSION is set: ${NSIGHT_VERSION}" | ||
else | ||
# Find the latest version dynamically | ||
echo "Fetching the latest Nsight Systems version..." | ||
NSIGHT_VERSION=$(wget -qO- "$BASE_URL" | grep -oP 'NsightSystems-linux-public-\K[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+-[0-9]+' | sort -V | tail -1) | ||
|
||
if [ -z "$NSIGHT_VERSION" ]; then | ||
echo "Failed to fetch the latest version. Exiting." | ||
exit 1 | ||
fi | ||
|
||
echo "Latest Nsight Systems version found: $NSIGHT_VERSION" | ||
fi | ||
|
||
# Security Validation | ||
if [[ "${NSIGHT_VERSION}" =~ ^[0-9.-]+$ ]]; then | ||
echo "NSIGHT_VERSION is valid: ${NSIGHT_VERSION}" | ||
else | ||
echo "NSIGHT_VERSION is invalid: ${NSIGHT_VERSION}" | ||
exit 1 | ||
fi | ||
|
||
# Construct the download URL | ||
DOWNLOAD_URL="${BASE_URL}NsightSystems-linux-public-${NSIGHT_VERSION}.run" | ||
|
||
# Define the installation directory (default is /opt/nvidia/nsight-systems) | ||
INSTALL_DIR="/opt/nvidia/nsight-systems" | ||
|
||
# Update and install prerequisites | ||
echo "Updating system and installing prerequisites..." | ||
apt-get update | ||
apt-get install -y wget build-essential aria2 expect | ||
|
||
# Download Nsight Systems installer | ||
echo "Downloading Nsight Systems ${NSIGHT_VERSION}..." | ||
aria2c -x 16 "$DOWNLOAD_URL" -o nsight-systems-installer.run | ||
|
||
# Verify the download | ||
if [ ! -f "nsight-systems-installer.run" ]; then | ||
echo "Download failed. Exiting." | ||
exit 1 | ||
fi | ||
|
||
# Make the installer executable | ||
echo "Making the installer executable..." | ||
chmod +x nsight-systems-installer.run | ||
|
||
# Run the installer | ||
echo "Running the Nsight Systems installer..." | ||
# The installer is not respecting the CLI arguments | ||
expect <<EOF | ||
spawn ./nsight-systems-installer.run --quiet --accept --target ${INSTALL_DIR} | ||
# Send ENTER and ACCEPT without waiting for specific prompts | ||
send "\r" | ||
sleep 1 | ||
send "ACCEPT\r" | ||
sleep 1 | ||
send "${INSTALL_DIR}\r" | ||
expect eof | ||
EOF | ||
|
||
# Add Nsight Systems to PATH | ||
echo "Adding Nsight Systems to PATH..." | ||
export PATH="${INSTALL_DIR}/pkg/bin:${PATH}" | ||
echo "export PATH=${INSTALL_DIR}/pkg/bin:\$PATH" >> ~/.bashrc | ||
source ~/.bashrc | ||
|
||
# Verify installation | ||
echo "Verifying Nsight Systems installation..." | ||
if command -v nsys &>/dev/null; then | ||
echo "Nsight Systems installed successfully!" | ||
nsys --version | ||
else | ||
echo "Nsight Systems installation failed." | ||
exit 1 | ||
fi | ||
|
||
# Clean up | ||
echo "Cleaning up installer..." | ||
rm -f nsight-systems-installer.run | ||
|
||
echo "Installation complete. You can now use Nsight Systems with the 'nsys' command." |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
#!/usr/bin/env bash | ||
set -e | ||
|
||
# Function to validate numeric variables | ||
validate_numeric_variable() { | ||
local var_name="$1" | ||
local var_value="$2" | ||
|
||
if [[ "${var_value}" =~ ^[0-9]+$ ]]; then | ||
echo "${var_name} is valid: ${var_value}" | ||
else | ||
echo "${var_name} is invalid: ${var_value}" | ||
exit 1 | ||
fi | ||
} | ||
|
||
# Delay for start of profile capture to avoid profiling unintended setup steps | ||
NSYS_PROFILE_DELAY=${NSYS_PROFILE_DELAY:-30} | ||
# Security Validation | ||
validate_numeric_variable "NSYS_PROFILE_DELAY" "${NSYS_PROFILE_DELAY}" | ||
|
||
# Duration for profile capture to avoid diluting the profile. | ||
NSYS_PROFILE_DURATION=${NSYS_PROFILE_DURATION:-600} | ||
# Security Validation | ||
validate_numeric_variable "NSYS_PROFILE_DURATION" "${NSYS_PROFILE_DURATION}" | ||
|
||
# Duration for profile capture to avoid diluting the profile. | ||
NSYS_PROFILE_TRACE=${NSYS_PROFILE_TRACE:-"cuda,nvtx,osrt,cudnn,cublas,mpi,python-gil"} | ||
# Security Validation | ||
if [[ "$NSYS_PROFILE_TRACE" =~ ^[a-z0-9,-]+$ ]]; then | ||
echo "NSYS_PROFILE_TRACE is valid: ${NSYS_PROFILE_TRACE}" | ||
else | ||
echo "NSYS_PROFILE_TRACE is invalid: ${NSYS_PROFILE_TRACE}" | ||
echo "Only lowercase letters, numbers, commas, and hyphens are allowed." | ||
exit 1 | ||
fi | ||
|
||
if [ -n "${S3_DEBUG_PATH}" ]; then | ||
# Validate the S3 path format | ||
if [[ ! "$S3_DEBUG_PATH" =~ ^s3://[a-z0-9.\-]+(/([a-zA-Z0-9.\-_]+)*)?/$ ]]; then | ||
echo "Error: S3_DEBUG_PATH must be of the format s3://bucket/key/" | ||
exit 1 | ||
fi | ||
fi | ||
|
||
nsys profile \ | ||
--kill=sigkill \ | ||
--wait=primary \ | ||
--show-output true \ | ||
--osrt-threshold 10000 \ | ||
--delay "${NSYS_PROFILE_DELAY}" \ | ||
--duration "${NSYS_PROFILE_DURATION}" \ | ||
--python-backtrace=cuda \ | ||
--trace "${NSYS_PROFILE_TRACE}" \ | ||
--cudabacktrace all:10000 \ | ||
--output "$(hostname).nsys-rep" \ | ||
-- djl-serving "$@" || true # Nsys exits with non-zero code when the application is terminated due to a timeout which is expected | ||
|
||
if [ -n "${S3_DEBUG_PATH}" ]; then | ||
s5cmd cp /opt/djl/*.nsys-rep "$S3_DEBUG_PATH" | ||
fi |