-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ll7/issue57-Slurm-multithreading-setting-might-be-slow #58
Ll7/issue57-Slurm-multithreading-setting-might-be-slow #58
Conversation
…rmation The SLURM user_queue.sh script has been updated to include the "-l" flag when querying the job queue. This provides more detailed information about the jobs owned by the user. This change improves the usability of the script by providing additional context for each job in the queue.
A new script, `partition_queue.sh`, has been added to the SLURM directory. This script allows users to query the job queue for the `epyc-gpu-test` partition. It provides a convenient way to check the status of jobs in this specific partition. This addition enhances the functionality of the SLURM scripts and improves the user experience.
WalkthroughThe changes introduce several updates, including the addition of a new extension for TensorBoard support in Visual Studio Code, the creation of a Python script Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SLURM
participant TrainingScript
participant TensorBoard
User->>SLURM: Submit job with slurm_train.sl
SLURM->>TrainingScript: Execute log_gpu_cpu_usage.py
TrainingScript->>TensorBoard: Log CPU and GPU usage
TrainingScript->>SLURM: Report training progress
TensorBoard->>User: Display resource usage metrics
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
🔇 Additional comments (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Outside diff range and nitpick comments (6)
SLURM/user_queue.sh (1)
3-3
: Approved: Enhanced job queue display with more detailed output.The addition of the
-l
flag to thesqueue
command is a good improvement. It will provide more comprehensive information about the user's jobs, which can be beneficial for debugging and monitoring purposes.Consider making the output format configurable to allow for both concise and detailed views. You could achieve this by using a shell variable or a command-line argument. Here's an example implementation:
#!/bin/bash # Default to short format FORMAT="" # Check if -l argument is provided if [[ "$1" == "-l" ]]; then FORMAT="-l" fi squeue -u $USER $FORMATThis way, users can choose between the default short format and the detailed long format by running either
./user_queue.sh
or./user_queue.sh -l
.SLURM/partition_queue.sh (1)
1-3
: Consider additional features and clarify the script's role in the project.While this script provides basic functionality for querying SLURM jobs, there are a few areas where it could be expanded to better integrate with the larger project:
Logging: Consider adding logging capabilities to track when and how often this script is called, and what results it returns.
Callback Integration: The PR description mentions a callback mechanism for waiting for jobs in the queue. It's not clear how this script integrates with that mechanism. Could you clarify how this script fits into the larger workflow?
GPU Usage Connection: The AI summary mentions GPU usage monitoring. Is there a way this script could incorporate or interact with GPU usage information?
Error Handling: Consider adding more robust error handling and reporting, especially if this script is part of an automated process.
Here's a suggestion for adding basic logging:
#! /bin/bash +# Set up logging +LOG_FILE="/path/to/partition_queue.log" +log() { + echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE" +} + +log "Script started with partition: $PARTITION" + squeue -p "$PARTITION" -l + +log "Script completed"Could you provide more context on how this script fits into the overall architecture of your SLURM job management system? This would help in providing more targeted advice for improvements and integration.
.vscode/settings.json (1)
2-2
: LGTM! Consider adding a prefix for consistency.The simplified branch naming convention looks good. Removing the
${user}
variable makes branch names more consistent across different contributors. This change aligns well with common practices in many projects.For even better consistency, consider adding a prefix to clearly identify issue-based branches. For example:
"githubIssues.issueBranchTitle": "issue-${issueNumber}-${sanitizedIssueTitle}"This format makes it immediately clear that the branch is related to an issue and maintains a consistent structure.
SLURM/slurm_train.sl (2)
Line range hint
24-28
: Consider uncommenting OMP_NUM_THREADS setting.The commented section about setting OMP_NUM_THREADS might be relevant to the performance issues mentioned in the PR objectives.
Consider uncommenting and setting
OMP_NUM_THREADS=1
if you're implementing your own parallelization in thelog_gpu_cpu_usage.py
script. This can prevent the creation of too many threads and potential slowdowns. If you're not sure about the parallelization in the new script, it might be worth investigating this aspect as part of the performance analysis.
Line range hint
1-30
: Summary and next steps for performance investigationThe changes to this script align with the PR objectives of investigating performance issues, particularly the Slurm multithreading settings. However, some modifications might impact the accuracy of performance measurements in a production-like environment.
To ensure a comprehensive performance investigation:
- Clarify the role of
log_gpu_cpu_usage.py
in the overall workflow.- Review resource allocations to ensure they match both the logging needs and represent real-world usage.
- Consider running tests on both the test and production partitions to compare performance.
- Ensure the job time limit allows for gathering sufficient data.
- Investigate the impact of OMP_NUM_THREADS settings on performance.
After gathering data with this script, consider creating a follow-up PR that implements any necessary changes to improve performance based on the findings.
SLURM/log_gpu_cpu_usage.py (1)
77-79
: Log Additional Environment Information for ReproducibilityConsider logging additional environment details such as the versions of
psutil
,GPUtil
, andstable_baselines3
. This information can be helpful for debugging and reproducing results.For example, add:
logger.info(f"psutil version: {psutil.__version__}") logger.info(f"GPUtil version: {GPUtil.__version__}") logger.info(f"Stable Baselines3 version: {stable_baselines3.__version__}")
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- .vscode/settings.json (1 hunks)
- SLURM/log_gpu_cpu_usage.py (1 hunks)
- SLURM/partition_queue.sh (1 hunks)
- SLURM/readme.md (1 hunks)
- SLURM/slurm_train.sl (2 hunks)
- SLURM/user_queue.sh (1 hunks)
Additional comments not posted (7)
SLURM/readme.md (1)
9-13
: Verify the conda environment nameThe instructions for activating the conda environment are clear. However, the environment name 'conda_env' seems generic. Please verify if this is the correct name for the project's environment and update it if necessary.
Verification successful
Conda environment name is correctly specified as 'conda_env'.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if there's a specific conda environment name mentioned in the project files # Search for conda environment names in relevant files rg -i 'conda (?:create|activate).*env.*' --type yaml --type bash --type pythonLength of output: 109
Script:
#!/bin/bash # Description: Corrected search for conda environment names in relevant files # Search for conda environment names in shell scripts, YAML, and Python files rg -i 'conda (create|activate) .*env.*' --type sh --type yaml --type pyLength of output: 119
SLURM/slurm_train.sl (3)
4-4
: Verify the partition change aligns with PR objectives.The partition has been changed from 'epyc-gpu' to 'epyc-gpu-test'. While this might be intentional for testing purposes, it's important to ensure this aligns with the PR objectives and won't negatively impact the performance investigation.
Could you confirm if this partition change is temporary for testing, or if it's intended to be a permanent change? Also, please verify that the 'epyc-gpu-test' partition has the necessary resources for the performance investigation.
30-30
: Script change aligns with PR objectives, but clarification needed.The change from
slurm_PPO_robot_sf.py
tolog_gpu_cpu_usage.py
aligns with the PR objectives of addressing performance issues. However, it's not clear if this is meant to replace the original training script or if it's a temporary change for debugging.Could you please clarify:
- Is this change temporary for debugging purposes, or is it intended to be a permanent replacement?
- If temporary, how will the original training script be integrated with this logging functionality?
- If permanent, does
log_gpu_cpu_usage.py
include the training functionality, or is it purely for logging?Additionally, it would be helpful to see the contents of
log_gpu_cpu_usage.py
to understand its functionality better. Could you provide this file for review?
Line range hint
10-14
: Review resource allocation for the new script.The script still requests 64 CPUs and 1 A100 GPU, which might be excessive for a logging script.
Please confirm if
log_gpu_cpu_usage.py
requires this level of resources. If not, consider adjusting the resource requests to match the script's actual needs. This could help improve cluster utilization and job scheduling efficiency.SLURM/log_gpu_cpu_usage.py (3)
54-58
: Review Checkpoint Saving FrequencyThe
save_model_callback
is configured to save checkpoints every500_000 // n_envs
steps, which equals 7,812 steps withn_envs = 64
. This might result in frequent disk writes and storage consumption.Ensure that this frequency aligns with your resource constraints and training needs. If necessary, adjust the checkpointing interval.
49-53
: Specify thepolicy_kwargs
Parameter CorrectlyThe
policy_kwargs
is defined but may not include all necessary parameters for theDynamicsExtractor
. Ensure that any required parameters for the custom feature extractor are passed correctly.Check the
DynamicsExtractor
class to confirm if additional arguments are needed.
47-53
: Verify Compatibility ofprogress_bar
Parameter with Stable Baselines3 VersionThe usage of
progress_bar=True
inmodel.learn()
requires Stable Baselines3 version 1.6.0 or higher. Ensure that the environment has the correct version installed to avoid potential errors.You can check the installed version of Stable Baselines3 using the following script:
… partition name The SLURM partition_queue.sh script has been updated to allow overriding the default partition name via a command line argument. This enhancement provides flexibility for users to query job queues in different partitions without modifying the script. It improves the usability of the script and enhances the user experience.
Now waiting for the simulation results. |
A new script, `set_email_env_variable.sh`, has been added to the SLURM directory. This script sets the email as an environment variable in the `.bashrc` file if it is not already set. It provides a convenient way for users to ensure that the SLURM_EMAIL variable is properly configured. This addition improves the usability of the SLURM scripts and enhances the user experience.
The `slurm_train.sl` script has been updated to include a check for the `SLURM_EMAIL` environment variable. If the variable is not set, a message is displayed prompting the user to set it before running the script. Additionally, email notification options have been added to the script using the `SBATCH --mail-user` and `SBATCH --mail-type` directives. This change improves the usability of the script by ensuring that the user is notified of job completion or failure via email.
… and set OMP_NUM_THREADS=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
SLURM/slurm_train.sl (2)
14-22
: Approve: Good implementation of optional email notifications.The addition of conditional email notifications based on the SLURM_EMAIL environment variable is a good improvement. It enhances flexibility and follows best practices.
Consider adding a comment explaining how to set the SLURM_EMAIL variable before running the script, for better user guidance. For example:
# Add this comment at the beginning of the script # To enable email notifications, set the SLURM_EMAIL environment variable before running: # export SLURM_EMAIL="[email protected]"
29-29
: Approve: Good practice to set OMP_NUM_THREADS=1.Setting
OMP_NUM_THREADS=1
is a good practice to prevent the creation of too many threads, which can lead to performance issues. This aligns well with the PR objective of addressing Slurm multithreading performance.Consider expanding the comment on line 29 to provide more context. For example:
# Set OMP_NUM_THREADS=1 to prevent conflicts with Stable-Baselines3's own parallelization. # This can help avoid oversubscription and potential slowdowns in multi-threaded environments.This more detailed comment would help future maintainers understand the reasoning behind this setting.
Also applies to: 45-45
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- SLURM/slurm_train.sl (2 hunks)
🔇 Additional comments not posted (3)
SLURM/slurm_train.sl (3)
4-5
: Verify the partition change and job time reduction.The partition has been changed from 'epyc-gpu' to 'epyc-gpu-test'. Please confirm if this change is intentional and appropriate for the current stage of development.
Regarding the job time reduction from 10 hours to 2 hours, please refer to the existing comment about this significant change. Ensure that 2 hours is sufficient for the
log_gpu_cpu_usage.py
script to gather all necessary data.
48-48
: Clarify the change in the executed script.The primary script executed by
srun
has been changed fromslurm_PPO_robot_sf.py
tolog_gpu_cpu_usage.py
. This appears to be related to the PR objective of addressing Slurm multithreading performance issues.Please clarify:
- Is this a temporary change for debugging purposes?
- If temporary, how will the original training script be integrated with this logging functionality?
- If permanent, how does this affect the overall workflow of the project?
Line range hint
1-51
: Summary: Good progress on addressing performance issues, some clarifications needed.Overall, the changes to this SLURM script align well with the PR objectives of addressing Slurm multithreading performance issues. The modifications to the partition, job time, email notifications, and OMP_NUM_THREADS setting are appropriate.
However, there are a few points that require clarification or additional documentation:
- The change in the executed script from training to logging.
- The intention behind the partition change.
- The sufficiency of the reduced job time.
Once these points are addressed and the suggested documentation improvements are made, this script will be in good shape to support the performance optimization efforts.
…b queue start time
…ser filter for querying job queue start time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- SLURM/partition_queue_start_time.sh (1 hunks)
🔇 Additional comments not posted (1)
SLURM/partition_queue_start_time.sh (1)
13-17
: Good practice: Checking for required command.The check for the existence of the
squeue
command is a good practice. It ensures that the script fails early with a clear error message if the required command is not available.
SLURM/partition_queue_start_time.sh
Outdated
#! /bin/bash | ||
|
||
# Script to query jobs in a specified SLURM partition | ||
|
||
# Default partition name | ||
PARTITION="epyc-gpu-test" | ||
|
||
# Allow overriding the partition name via command line argument | ||
if [ $# -eq 1 ]; then | ||
PARTITION="$1" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider enhancing script flexibility with command-line options.
The script effectively sets a default partition and allows overriding it. However, to improve flexibility and usability, consider implementing more robust command-line option handling.
You could use getopts
to handle multiple options, such as:
-p
or--partition
to specify the partition-h
or--help
to display usage information
Here's an example implementation:
#!/bin/bash
# Default partition name
PARTITION="epyc-gpu-test"
# Function to display usage information
usage() {
echo "Usage: $0 [-p PARTITION]"
echo " -p PARTITION Specify the SLURM partition (default: epyc-gpu-test)"
echo " -h Display this help message"
}
# Parse command-line options
while getopts ":p:h" opt; do
case ${opt} in
p )
PARTITION=$OPTARG
;;
h )
usage
exit 0
;;
\? )
echo "Invalid option: $OPTARG" 1>&2
usage
exit 1
;;
: )
echo "Invalid option: $OPTARG requires an argument" 1>&2
usage
exit 1
;;
esac
done
This approach provides better user guidance and allows for future expansion of command-line options.
…e57-Slurm-multithreading-setting-might-be-slow
…ser filter for querying job queue start time
@coderabbitai configuration |
CodeRabbit configurationlanguage: en-US
tone_instructions: ''
early_access: false
enable_free_tier: true
reviews:
profile: chill
request_changes_workflow: false
high_level_summary: true
high_level_summary_placeholder: '@coderabbitai summary'
auto_title_placeholder: '@coderabbitai'
review_status: true
poem: true
collapse_walkthrough: false
sequence_diagrams: true
changed_files_summary: true
labeling_instructions: []
path_filters: []
path_instructions: []
abort_on_close: true
auto_review:
enabled: true
auto_incremental_review: true
ignore_title_keywords: []
labels: []
drafts: false
base_branches: []
tools:
shellcheck:
enabled: true
ruff:
enabled: true
markdownlint:
enabled: true
github-checks:
enabled: true
timeout_ms: 90000
languagetool:
enabled: true
enabled_only: false
level: default
biome:
enabled: true
hadolint:
enabled: true
swiftlint:
enabled: true
phpstan:
enabled: true
level: default
golangci-lint:
enabled: true
yamllint:
enabled: true
gitleaks:
enabled: true
checkov:
enabled: true
detekt:
enabled: true
eslint:
enabled: true
rubocop:
enabled: true
buf:
enabled: true
regal:
enabled: true
actionlint:
enabled: true
pmd:
enabled: true
cppcheck:
enabled: true
semgrep:
enabled: true
circleci:
enabled: true
chat:
auto_reply: true
knowledge_base:
opt_out: false
learnings:
scope: auto
issues:
scope: auto
jira:
project_keys: []
linear:
team_keys: []
pull_requests:
scope: auto
|
This commit updates the `log_gpu_cpu_usage.py` script in the SLURM directory to include logging of GPU memory utilization in addition to CPU and GPU usage. The `LogResourceUsageCallback` class now records the memory utilization of each GPU in the `gpu_{idx}_memory_util` field. This enhancement provides more comprehensive monitoring of resource usage during training.
add possible callback, waiting for job in queue
Summary by CodeRabbit
New Features
Bug Fixes