Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multi-hpu training with FSDP for Gaudi 3 cards #294

Open
Tracked by #2623
ktam3 opened this issue Oct 22, 2024 · 0 comments · May be fixed by #330
Open
Tracked by #2623

Implement multi-hpu training with FSDP for Gaudi 3 cards #294

ktam3 opened this issue Oct 22, 2024 · 0 comments · May be fixed by #330
Assignees

Comments

@ktam3
Copy link

ktam3 commented Oct 22, 2024

Tasks Needed:

  • get a toy FSDP loop working on Gaudi (incomplete so far)
  • test our code on cards - this shouldn't work immediately.
  • check if we can run a toy Accelerate+FSDP loop working on Gaudi cards (I'm worried this won't work)
    • if YES then we change our code to accommodate Gaudi+AMD+Nvidia, and then build config.s
    • if NO then we have two choices:
      • implement a Gaudi-only FSDP training loop (the easiest / cheapest option, and probably the one we'll go with)
      • patch Accelerate to support Gaudi and work on committing this upstream (something I think we should do even if we go with route 1)

Risks:

Current machine used to test this has issues that are delaying overall progress for James to fully confirm whether or not we can target this for 1.3. He is currently discussing with the Intel team and @tiran to troubleshoot further. This is currently a major highlighted risk since we want to target this Tech Preview for 1.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants