microsoft / DeepSpeed Public

Notifications You must be signed in to change notification settings
Fork 4.1k
Star 35.5k

Code
Issues 959
Pull requests 122
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Issues: microsoft/DeepSpeed

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

959 Open 1,872 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[BUG] [Fix Suggestion] Uneven head sequence parallelism bug

Something isn't working

training

#6774 opened Nov 21, 2024 by Eugene29

Something isn't working

training

#6772 opened Nov 20, 2024 by traincheck-team

#6771 opened Nov 20, 2024 by traincheck-team

#6770 opened Nov 20, 2024 by traincheck-team

grad is None bug

Something isn't working

training

#6768 opened Nov 20, 2024 by suanflower

[BUG] clip_grad_norm for zero_optimization mode is not working bug

Something isn't working

training

#6767 opened Nov 20, 2024 by chengmengli06

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 bug

Something isn't working

training

#6756 opened Nov 18, 2024 by MLS2021

Some Demos on How to config to offload tensors to nvme device

#6752 opened Nov 15, 2024 by niebowen666

Model Checkpoint docs are incorrectly rendered on deepspeed.readthedocs.io bug

Something isn't working

documentation

Improvements or additions to documentation

#6747 opened Nov 12, 2024 by akeshet

Whether Deepspeed-Domino is compatible with other parallel strategy?

#6744 opened Nov 12, 2024 by Andy666G

[BUG] max_grad_norm not effect bug

Something isn't working

compression

#6743 opened Nov 12, 2024 by yiyepiaoling0715

About offload stage3 source code learning problems

#6735 opened Nov 9, 2024 by lzy-edu

GPU mem doesn't release after delete tensors in optimizer.bit16groups

#6729 opened Nov 8, 2024 by wheresmyhair

[BUG] any clue for MFU drop? bug

Something isn't working

training

#6727 opened Nov 8, 2024 by SeunghyunSEO

[BUG] [ROCm] Fine-tuning DeepSeek-Coder-V2-Lite-Instruct with 8 MI300X GPUs results in c10::DistBackendError bug

Something isn't working

rocm

AMD/ROCm/HIP issues

training

#6725 opened Nov 8, 2024 by nikhil-tensorwave

ZeRO-3 + MP8 Universal Checkpoint

#6724 opened Nov 7, 2024 by jeromeku

[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm bug

Something isn't working

training

#6719 opened Nov 6, 2024 by yitingw1

[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled bug

Something isn't working

training

#6718 opened Nov 6, 2024 by jerrychenhf

[BUG] pipeline parallelism+fp16+moe isn't working

#6714 opened Nov 5, 2024 by NeferpitouS3

[BUG]Issue with Zero Optimization for Llama-2-7b Fine-Tuning on Intel GPUs bug

Something isn't working

training

#6713 opened Nov 5, 2024 by molang66

"__nv_bfloat162" has already been defined install

Installation and package dependencies

windows

#6709 opened Nov 4, 2024 by wolfljj

[REQUEST] Some questions about deepspeed sequence parallel enhancement

New feature or request

#6708 opened Nov 4, 2024 by yingtongxiong

[REQUEST] Non-element-wise Optimizer Compatibility enhancement

New feature or request

#6701 opened Nov 2, 2024 by Triang-jyed-driung

How could I convert ZeRO-0 deepspeed weights into fp32 model checkpoint? enhancement

New feature or request

#6699 opened Nov 1, 2024 by liming-ai

[BUG] Universal Checkpoint Conversion: Resumed Training Behaves as If Model Initialized from Scratch bug

Something isn't working

training

#6691 opened Oct 30, 2024 by purefall

Previous 1 2 3 4 5 … 38 39 Next

Previous Next

ProTip! Exclude everything labeled bug with -label:bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly