Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ll7/issue57-Slurm-multithreading-setting-might-be-slow #58

Merged
merged 19 commits into from
Sep 26, 2024

Conversation

ll7
Copy link
Owner

@ll7 ll7 commented Sep 24, 2024

add possible callback, waiting for job in queue

Summary by CodeRabbit

  • New Features

    • Added TensorBoard support to enhance the Visual Studio Code environment.
    • Introduced a training script for monitoring CPU and GPU usage during robot training.
    • Added new shell scripts for querying job details in the SLURM workload manager and for setting email notifications.
    • Updated instructions in the README for tracking resource usage during training.
    • Modified SLURM job script to run the new resource logging script.
    • Enhanced user job queue display with more detailed information.
  • Bug Fixes

    • Improved output format for user job queue in SLURM.

ll7 added 6 commits September 24, 2024 16:59
…rmation

The SLURM user_queue.sh script has been updated to include the "-l" flag when querying the job queue. This provides more detailed information about the jobs owned by the user. This change improves the usability of the script by providing additional context for each job in the queue.
A new script, `partition_queue.sh`, has been added to the SLURM directory. This script allows users to query the job queue for the `epyc-gpu-test` partition. It provides a convenient way to check the status of jobs in this specific partition. This addition enhances the functionality of the SLURM scripts and improves the user experience.
Copy link
Contributor

coderabbitai bot commented Sep 24, 2024

Walkthrough

The changes introduce several updates, including the addition of a new extension for TensorBoard support in Visual Studio Code, the creation of a Python script log_gpu_cpu_usage.py for tracking CPU and GPU usage during training on a SLURM server, and new shell scripts for querying job status and setting email notifications. Additionally, adjustments were made to the SLURM job script slurm_train.sl, and documentation was updated in readme.md to reflect these changes.

Changes

File Change Summary
.vscode/extensions.json Added new extension identifier "ms-toolsai.tensorboard" to support TensorBoard in Visual Studio Code.
SLURM/log_gpu_cpu_usage.py Introduced a new script for training a robot, logging CPU and GPU metrics using TensorBoard, with a custom callback and a PPO model setup for training.
SLURM/partition_queue.sh Created a new shell script to execute the squeue command, querying jobs in the epyc-gpu-test partition with detailed output.
SLURM/partition_queue_start_time.sh New script to query jobs in a specified SLURM partition, capturing the current user's job status.
SLURM/readme.md Added instructions for tracking CPU and GPU usage during training, including package installation and job submission steps.
SLURM/slurm_train.sl Modified job script to change partition to epyc-gpu-test, reduced job time from 10:00:00 to 2:00:00, and updated the command to run log_gpu_cpu_usage.py.
SLURM/user_queue.sh Updated command to display user job queue with squeue -u $USER -l for detailed output.
SLURM/set_email_env_variable.sh Introduced a new script to set the SLURM_EMAIL environment variable in the user's .bashrc file.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SLURM
    participant TrainingScript
    participant TensorBoard

    User->>SLURM: Submit job with slurm_train.sl
    SLURM->>TrainingScript: Execute log_gpu_cpu_usage.py
    TrainingScript->>TensorBoard: Log CPU and GPU usage
    TrainingScript->>SLURM: Report training progress
    TensorBoard->>User: Display resource usage metrics
Loading

🐇 In the code, a change so bright,
Branch titles now are light!
With logs of CPU and GPU,
Training's smooth, and metrics too!
Hopping through the SLURM queue,
A joyful script for me and you! 🐇


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between cb14077 and ac407b7.

📒 Files selected for processing (2)
  • .vscode/extensions.json (1 hunks)
  • SLURM/log_gpu_cpu_usage.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • SLURM/log_gpu_cpu_usage.py
🔇 Additional comments (1)
.vscode/extensions.json (1)

6-7: Approved: Addition of TensorBoard extension

The inclusion of the TensorBoard extension ("ms-toolsai.tensorboard") is a valuable addition to the project's VS Code setup. This aligns well with the PR's focus on addressing performance issues and Slurm multithreading settings.

TensorBoard can be particularly useful for visualizing and analyzing performance metrics, which could aid in identifying and resolving the slow performance mentioned in the PR objectives.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Outside diff range and nitpick comments (6)
SLURM/user_queue.sh (1)

3-3: Approved: Enhanced job queue display with more detailed output.

The addition of the -l flag to the squeue command is a good improvement. It will provide more comprehensive information about the user's jobs, which can be beneficial for debugging and monitoring purposes.

Consider making the output format configurable to allow for both concise and detailed views. You could achieve this by using a shell variable or a command-line argument. Here's an example implementation:

#!/bin/bash

# Default to short format
FORMAT=""

# Check if -l argument is provided
if [[ "$1" == "-l" ]]; then
    FORMAT="-l"
fi

squeue -u $USER $FORMAT

This way, users can choose between the default short format and the detailed long format by running either ./user_queue.sh or ./user_queue.sh -l.

SLURM/partition_queue.sh (1)

1-3: Consider additional features and clarify the script's role in the project.

While this script provides basic functionality for querying SLURM jobs, there are a few areas where it could be expanded to better integrate with the larger project:

  1. Logging: Consider adding logging capabilities to track when and how often this script is called, and what results it returns.

  2. Callback Integration: The PR description mentions a callback mechanism for waiting for jobs in the queue. It's not clear how this script integrates with that mechanism. Could you clarify how this script fits into the larger workflow?

  3. GPU Usage Connection: The AI summary mentions GPU usage monitoring. Is there a way this script could incorporate or interact with GPU usage information?

  4. Error Handling: Consider adding more robust error handling and reporting, especially if this script is part of an automated process.

Here's a suggestion for adding basic logging:

 #! /bin/bash

+# Set up logging
+LOG_FILE="/path/to/partition_queue.log"
+log() {
+    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
+}
+
+log "Script started with partition: $PARTITION"
+
 squeue -p "$PARTITION" -l
+
+log "Script completed"

Could you provide more context on how this script fits into the overall architecture of your SLURM job management system? This would help in providing more targeted advice for improvements and integration.

.vscode/settings.json (1)

2-2: LGTM! Consider adding a prefix for consistency.

The simplified branch naming convention looks good. Removing the ${user} variable makes branch names more consistent across different contributors. This change aligns well with common practices in many projects.

For even better consistency, consider adding a prefix to clearly identify issue-based branches. For example:

"githubIssues.issueBranchTitle": "issue-${issueNumber}-${sanitizedIssueTitle}"

This format makes it immediately clear that the branch is related to an issue and maintains a consistent structure.

SLURM/slurm_train.sl (2)

Line range hint 24-28: Consider uncommenting OMP_NUM_THREADS setting.

The commented section about setting OMP_NUM_THREADS might be relevant to the performance issues mentioned in the PR objectives.

Consider uncommenting and setting OMP_NUM_THREADS=1 if you're implementing your own parallelization in the log_gpu_cpu_usage.py script. This can prevent the creation of too many threads and potential slowdowns. If you're not sure about the parallelization in the new script, it might be worth investigating this aspect as part of the performance analysis.


Line range hint 1-30: Summary and next steps for performance investigation

The changes to this script align with the PR objectives of investigating performance issues, particularly the Slurm multithreading settings. However, some modifications might impact the accuracy of performance measurements in a production-like environment.

To ensure a comprehensive performance investigation:

  1. Clarify the role of log_gpu_cpu_usage.py in the overall workflow.
  2. Review resource allocations to ensure they match both the logging needs and represent real-world usage.
  3. Consider running tests on both the test and production partitions to compare performance.
  4. Ensure the job time limit allows for gathering sufficient data.
  5. Investigate the impact of OMP_NUM_THREADS settings on performance.

After gathering data with this script, consider creating a follow-up PR that implements any necessary changes to improve performance based on the findings.

SLURM/log_gpu_cpu_usage.py (1)

77-79: Log Additional Environment Information for Reproducibility

Consider logging additional environment details such as the versions of psutil, GPUtil, and stable_baselines3. This information can be helpful for debugging and reproducing results.

For example, add:

logger.info(f"psutil version: {psutil.__version__}")
logger.info(f"GPUtil version: {GPUtil.__version__}")
logger.info(f"Stable Baselines3 version: {stable_baselines3.__version__}")
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between fa11313 and f9efdad.

Files selected for processing (6)
  • .vscode/settings.json (1 hunks)
  • SLURM/log_gpu_cpu_usage.py (1 hunks)
  • SLURM/partition_queue.sh (1 hunks)
  • SLURM/readme.md (1 hunks)
  • SLURM/slurm_train.sl (2 hunks)
  • SLURM/user_queue.sh (1 hunks)
Additional comments not posted (7)
SLURM/readme.md (1)

9-13: Verify the conda environment name

The instructions for activating the conda environment are clear. However, the environment name 'conda_env' seems generic. Please verify if this is the correct name for the project's environment and update it if necessary.

Verification successful

Conda environment name is correctly specified as 'conda_env'.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if there's a specific conda environment name mentioned in the project files

# Search for conda environment names in relevant files
rg -i 'conda (?:create|activate).*env.*' --type yaml --type bash --type python

Length of output: 109


Script:

#!/bin/bash
# Description: Corrected search for conda environment names in relevant files

# Search for conda environment names in shell scripts, YAML, and Python files
rg -i 'conda (create|activate) .*env.*' --type sh --type yaml --type py

Length of output: 119

SLURM/slurm_train.sl (3)

4-4: Verify the partition change aligns with PR objectives.

The partition has been changed from 'epyc-gpu' to 'epyc-gpu-test'. While this might be intentional for testing purposes, it's important to ensure this aligns with the PR objectives and won't negatively impact the performance investigation.

Could you confirm if this partition change is temporary for testing, or if it's intended to be a permanent change? Also, please verify that the 'epyc-gpu-test' partition has the necessary resources for the performance investigation.


30-30: Script change aligns with PR objectives, but clarification needed.

The change from slurm_PPO_robot_sf.py to log_gpu_cpu_usage.py aligns with the PR objectives of addressing performance issues. However, it's not clear if this is meant to replace the original training script or if it's a temporary change for debugging.

Could you please clarify:

  1. Is this change temporary for debugging purposes, or is it intended to be a permanent replacement?
  2. If temporary, how will the original training script be integrated with this logging functionality?
  3. If permanent, does log_gpu_cpu_usage.py include the training functionality, or is it purely for logging?

Additionally, it would be helpful to see the contents of log_gpu_cpu_usage.py to understand its functionality better. Could you provide this file for review?


Line range hint 10-14: Review resource allocation for the new script.

The script still requests 64 CPUs and 1 A100 GPU, which might be excessive for a logging script.

Please confirm if log_gpu_cpu_usage.py requires this level of resources. If not, consider adjusting the resource requests to match the script's actual needs. This could help improve cluster utilization and job scheduling efficiency.

SLURM/log_gpu_cpu_usage.py (3)

54-58: Review Checkpoint Saving Frequency

The save_model_callback is configured to save checkpoints every 500_000 // n_envs steps, which equals 7,812 steps with n_envs = 64. This might result in frequent disk writes and storage consumption.

Ensure that this frequency aligns with your resource constraints and training needs. If necessary, adjust the checkpointing interval.


49-53: Specify the policy_kwargs Parameter Correctly

The policy_kwargs is defined but may not include all necessary parameters for the DynamicsExtractor. Ensure that any required parameters for the custom feature extractor are passed correctly.

Check the DynamicsExtractor class to confirm if additional arguments are needed.


47-53: Verify Compatibility of progress_bar Parameter with Stable Baselines3 Version

The usage of progress_bar=True in model.learn() requires Stable Baselines3 version 1.6.0 or higher. Ensure that the environment has the correct version installed to avoid potential errors.

You can check the installed version of Stable Baselines3 using the following script:

SLURM/partition_queue.sh Outdated Show resolved Hide resolved
SLURM/partition_queue.sh Outdated Show resolved Hide resolved
SLURM/readme.md Show resolved Hide resolved
SLURM/readme.md Show resolved Hide resolved
SLURM/readme.md Show resolved Hide resolved
SLURM/readme.md Show resolved Hide resolved
SLURM/slurm_train.sl Show resolved Hide resolved
SLURM/log_gpu_cpu_usage.py Outdated Show resolved Hide resolved
SLURM/log_gpu_cpu_usage.py Outdated Show resolved Hide resolved
ll7 added 3 commits September 25, 2024 09:10
… partition name

The SLURM partition_queue.sh script has been updated to allow overriding the default partition name via a command line argument. This enhancement provides flexibility for users to query job queues in different partitions without modifying the script. It improves the usability of the script and enhances the user experience.
@ll7
Copy link
Owner Author

ll7 commented Sep 25, 2024

Now waiting for the simulation results.

ll7 added 3 commits September 25, 2024 10:03
A new script, `set_email_env_variable.sh`, has been added to the SLURM directory. This script sets the email as an environment variable in the `.bashrc` file if it is not already set. It provides a convenient way for users to ensure that the SLURM_EMAIL variable is properly configured. This addition improves the usability of the SLURM scripts and enhances the user experience.
The `slurm_train.sl` script has been updated to include a check for the `SLURM_EMAIL` environment variable. If the variable is not set, a message is displayed prompting the user to set it before running the script. Additionally, email notification options have been added to the script using the `SBATCH --mail-user` and `SBATCH --mail-type` directives. This change improves the usability of the script by ensuring that the user is notified of job completion or failure via email.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
SLURM/slurm_train.sl (2)

14-22: Approve: Good implementation of optional email notifications.

The addition of conditional email notifications based on the SLURM_EMAIL environment variable is a good improvement. It enhances flexibility and follows best practices.

Consider adding a comment explaining how to set the SLURM_EMAIL variable before running the script, for better user guidance. For example:

# Add this comment at the beginning of the script
# To enable email notifications, set the SLURM_EMAIL environment variable before running:
# export SLURM_EMAIL="[email protected]"

29-29: Approve: Good practice to set OMP_NUM_THREADS=1.

Setting OMP_NUM_THREADS=1 is a good practice to prevent the creation of too many threads, which can lead to performance issues. This aligns well with the PR objective of addressing Slurm multithreading performance.

Consider expanding the comment on line 29 to provide more context. For example:

# Set OMP_NUM_THREADS=1 to prevent conflicts with Stable-Baselines3's own parallelization.
# This can help avoid oversubscription and potential slowdowns in multi-threaded environments.

This more detailed comment would help future maintainers understand the reasoning behind this setting.

Also applies to: 45-45

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 13049cd and 835ac18.

📒 Files selected for processing (1)
  • SLURM/slurm_train.sl (2 hunks)
🔇 Additional comments not posted (3)
SLURM/slurm_train.sl (3)

4-5: Verify the partition change and job time reduction.

The partition has been changed from 'epyc-gpu' to 'epyc-gpu-test'. Please confirm if this change is intentional and appropriate for the current stage of development.

Regarding the job time reduction from 10 hours to 2 hours, please refer to the existing comment about this significant change. Ensure that 2 hours is sufficient for the log_gpu_cpu_usage.py script to gather all necessary data.


48-48: Clarify the change in the executed script.

The primary script executed by srun has been changed from slurm_PPO_robot_sf.py to log_gpu_cpu_usage.py. This appears to be related to the PR objective of addressing Slurm multithreading performance issues.

Please clarify:

  1. Is this a temporary change for debugging purposes?
  2. If temporary, how will the original training script be integrated with this logging functionality?
  3. If permanent, how does this affect the overall workflow of the project?

Line range hint 1-51: Summary: Good progress on addressing performance issues, some clarifications needed.

Overall, the changes to this SLURM script align well with the PR objectives of addressing Slurm multithreading performance issues. The modifications to the partition, job time, email notifications, and OMP_NUM_THREADS setting are appropriate.

However, there are a few points that require clarification or additional documentation:

  1. The change in the executed script from training to logging.
  2. The intention behind the partition change.
  3. The sufficiency of the reduced job time.

Once these points are addressed and the suggested documentation improvements are made, this script will be in good shape to support the performance optimization efforts.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 835ac18 and ac982ab.

📒 Files selected for processing (1)
  • SLURM/partition_queue_start_time.sh (1 hunks)
🔇 Additional comments not posted (1)
SLURM/partition_queue_start_time.sh (1)

13-17: Good practice: Checking for required command.

The check for the existence of the squeue command is a good practice. It ensures that the script fails early with a clear error message if the required command is not available.

Comment on lines 1 to 11
#! /bin/bash

# Script to query jobs in a specified SLURM partition

# Default partition name
PARTITION="epyc-gpu-test"

# Allow overriding the partition name via command line argument
if [ $# -eq 1 ]; then
PARTITION="$1"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider enhancing script flexibility with command-line options.

The script effectively sets a default partition and allows overriding it. However, to improve flexibility and usability, consider implementing more robust command-line option handling.

You could use getopts to handle multiple options, such as:

  • -p or --partition to specify the partition
  • -h or --help to display usage information

Here's an example implementation:

#!/bin/bash

# Default partition name
PARTITION="epyc-gpu-test"

# Function to display usage information
usage() {
    echo "Usage: $0 [-p PARTITION]"
    echo "  -p PARTITION  Specify the SLURM partition (default: epyc-gpu-test)"
    echo "  -h            Display this help message"
}

# Parse command-line options
while getopts ":p:h" opt; do
    case ${opt} in
        p )
            PARTITION=$OPTARG
            ;;
        h )
            usage
            exit 0
            ;;
        \? )
            echo "Invalid option: $OPTARG" 1>&2
            usage
            exit 1
            ;;
        : )
            echo "Invalid option: $OPTARG requires an argument" 1>&2
            usage
            exit 1
            ;;
    esac
done

This approach provides better user guidance and allows for future expansion of command-line options.

SLURM/partition_queue_start_time.sh Outdated Show resolved Hide resolved
SLURM/partition_queue_start_time.sh Show resolved Hide resolved
SLURM/partition_queue_start_time.sh Outdated Show resolved Hide resolved
@ll7 ll7 marked this pull request as draft September 25, 2024 11:51
@ll7
Copy link
Owner Author

ll7 commented Sep 25, 2024

@coderabbitai configuration

Copy link
Contributor

coderabbitai bot commented Sep 25, 2024

CodeRabbit configuration
language: en-US
tone_instructions: ''
early_access: false
enable_free_tier: true
reviews:
  profile: chill
  request_changes_workflow: false
  high_level_summary: true
  high_level_summary_placeholder: '@coderabbitai summary'
  auto_title_placeholder: '@coderabbitai'
  review_status: true
  poem: true
  collapse_walkthrough: false
  sequence_diagrams: true
  changed_files_summary: true
  labeling_instructions: []
  path_filters: []
  path_instructions: []
  abort_on_close: true
  auto_review:
    enabled: true
    auto_incremental_review: true
    ignore_title_keywords: []
    labels: []
    drafts: false
    base_branches: []
  tools:
    shellcheck:
      enabled: true
    ruff:
      enabled: true
    markdownlint:
      enabled: true
    github-checks:
      enabled: true
      timeout_ms: 90000
    languagetool:
      enabled: true
      enabled_only: false
      level: default
    biome:
      enabled: true
    hadolint:
      enabled: true
    swiftlint:
      enabled: true
    phpstan:
      enabled: true
      level: default
    golangci-lint:
      enabled: true
    yamllint:
      enabled: true
    gitleaks:
      enabled: true
    checkov:
      enabled: true
    detekt:
      enabled: true
    eslint:
      enabled: true
    rubocop:
      enabled: true
    buf:
      enabled: true
    regal:
      enabled: true
    actionlint:
      enabled: true
    pmd:
      enabled: true
    cppcheck:
      enabled: true
    semgrep:
      enabled: true
    circleci:
      enabled: true
chat:
  auto_reply: true
knowledge_base:
  opt_out: false
  learnings:
    scope: auto
  issues:
    scope: auto
  jira:
    project_keys: []
  linear:
    team_keys: []
  pull_requests:
    scope: auto

ll7 added 2 commits September 26, 2024 10:19
This commit updates the `log_gpu_cpu_usage.py` script in the SLURM directory to include logging of GPU memory utilization in addition to CPU and GPU usage. The `LogResourceUsageCallback` class now records the memory utilization of each GPU in the `gpu_{idx}_memory_util` field. This enhancement provides more comprehensive monitoring of resource usage during training.
@ll7 ll7 marked this pull request as ready for review September 26, 2024 11:39
@ll7 ll7 merged commit b62e6b7 into main Sep 26, 2024
2 checks passed
@ll7 ll7 deleted the ll7/issue57-Slurm-multithreading-setting-might-be-slow branch September 26, 2024 11:43
@coderabbitai coderabbitai bot mentioned this pull request Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant