Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create proposal for training repo #85

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@

# Spelling
dictionary.dic

venv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was merged in #90. Please rebase.

10 changes: 10 additions & 0 deletions .spellcheck-en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,14 +37,17 @@ Eval
Excalidraw
exfiltrate
exfiltrating
Eyceoz
Finetuning
formedness
Gaudi
GFX
GGUF
GGUFs
GiB
Gmail
gpu
Habana
hipBLAS
ilab
impactful
Expand All @@ -58,6 +61,7 @@ Kaggle
Kaggle's
Kai
Kubernetes
Kunstle
lignment
LLM
llms
Expand All @@ -72,8 +76,12 @@ Miniforge
Mixtral
MLX
mlx
MPS
Mustafa
NCCL
NVidia
Nvidia
Oleg
ollama
Ollama
orchestrator
Expand All @@ -98,6 +106,7 @@ RDNA
README
rebase
repo
repos
ROCm
RTX
RX
Expand All @@ -110,6 +119,7 @@ sexualized
SHA
Shivchander
Signoff
Silkin
Srivastava
subdirectory
Sudalairaj
Expand Down
77 changes: 77 additions & 0 deletions docs/training/training-repo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Modify Repository Proposal: Training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about simply docs/training/repo.md? We are in the training directory already so no need to repeat it.


## Summary

This document proposes taking the existing [instructlab/training](https://github.com/instructlab/training) and transforming it into a library that will be consumed
by other clients such as the CLI and what is currently the `main_ds.py` file
that launches a full fine-tune.

## Background

Today, we have training implemented for different use-cases
existing in different repos. LoRA/QLoRA training currently lives
in the [`instructlab`](https://github.com/instructlab/instructlab)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this logic will have to be extracted to instructlab/training once we agree on this proposal, right?

CLI repo and contains logic for both Linux and MacOS training.
Each of these then brings additional logic for platform-specific accelerators.
For example NVIDIA CUDA and Intel Habana on Linux, and MLX and MPS on MacOS.

On the other hand, we have logic that exists today for full fine-tuning
in the [instructlab/training](https://github.com/instructlab/training) repo. Today, this implementation is CUDA-specific
as it relies on NCCL for handling communication across various
NVIDIA GPUs.
We want to bring support for alternative accelerators such as Intel Gaudi to this repo as well, without rewriting much of the logic.

We need to have a library which provides a common interface
for training, allowing all of the hard logic to be concentrated
in a single point.

The [instructlab/training](https://github.com/instructlab/training) repo should then
become the home for the overall training library.

This library will then contain a "simple" interface for all
other clients to pull from and use without much needing to change
on their side other than how they intend to use it.

## Maintainers

The maintainers should be the folks who currently work on training.
Namely:

- [Aldo Pareja](https://github.com/aldopareja)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to be added to this as the community train rep currently. Or @alimaredia. If we are planning on moving that code to the library, I would appreciate a part in that 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sdg and eval repos bootstrapped access by copying the existing Backend Maintainers team. That's an alternative that could be used here.

- [James Kunstle](https://github.com/orgs/instructlab/people/JamesKunstle)
- [Oleg Silkin](https://github.com/orgs/instructlab/people/RobotSail)
- [Mustafa Eyceoz](https://github.com/orgs/instructlab/people/Maxusmusti)
Comment on lines +41 to +43
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 links return 404 for me.


We can also blanket access by setting it to the `Backend Maintainers` GitHub team.

## Alternatives Considered

### Keep everything separate

Rather than consolidating the logic, we can keep the existing
logic as-is.

This would mean that the [CLI repo](https://github.com/instructlab/instructlab) and the [training repo](https://github.com/instructlab/training) would both maintain their own implementations of training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more what I was thinking, and this overlaps with #52 alot. Something we need to reconcile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment mean that this alternative is what you had in mind?


This means that if extra logic must be added for Intel Gaudi or NVIDIA, it would need to be added and tested in two different places.

Since we need to move the full training to live inside of the the
[CLI repo](https://githhub.com/instructlab/instructlab),
this would now have two duplicate implementations of the
full fine-tune training and add additional points of complications.

### Move everything into one repo but don't design an interface

We can move the existing training logic into the training repo
and simply have the existing clients consume it this way.

The challenge here is that a lot of the logic is very specific
to the client application, so there would be cross-development.

If someone wants to create a new PR to the CLI repo for a change
they're making to the LoRA training, they'd need to create
another PR into the training repo and maintain two PRs at once.
When a maintainer requests a change in one PR, both would need to be updated to accommodate each other.

The challenges this presents to both the developers and maintainers
is clear. Therefore a natural conclusion is that we need a separate library that provides an extensible interface for other clients to consume.