Skip to content

Add ADR for what repository will contain RAG #163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from

Conversation

jwm4
Copy link
Contributor

@jwm4 jwm4 commented Dec 6, 2024

No description provided.


- There will be a new repository for RAG.
- It will be located at https://github.com/instructlab/retrieval-augmented-generation
- By mid-January, it will be available and working but not integrated with InstructLab.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some rationale behind these two dates? just to hold ourselves accountable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dates are driven mainly by availability of people to do the work. The mid-March elements of the plan will require a lot of coordination with the maintainers of the core instructlab/instructlab repo. Those people are mostly already committed to other activities through mid-January.

Comment on lines 35 to 36
- By mid-March, it will be integrated with InstructLab with the new repository being invoked by the core repository and maybe also by the SDG repository.
- Eventually, it will be integrated with InstructLab with the new repository being invoked only by the core repository.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with necessarily assigning dates to these efforts from now. I think it does make sense to split up these efforts into phases, and then we can internally scope out work for the desired releases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the dates are important, but I am fine with dropping them from this document and using other venues to get aligned around these dates. I will do that.

- Con: Many things we will want to do to add advanced functionality to make RAG more effective will require changes to both indexing and run-time RAG. If those components are split across multiple repositories, that will make delivering such changes more complicated.
- Put the indexing and run-time RAG code in <https://github.com/instructlab/instructlab> (core)
- Pro: This has the advantage of not adding any new dependencies.
- Pro: However, since the existing document processing is in SDG, the flow for indexing for RAG would be a bit complicated (i.e., it starts with a CLI call handled by the core repo then goes to SDG for some of the document processing and then back to the core for vector database indexing). That drawback will be eliminated if/when the document processing moves into the core repository.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure it's an issue at all that it's making a call to SDG since the instructlab/instructlab dependencies contain SDG anyway. We don't expect users to be using ilab without SDG

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, I do agree with the below cons and don't think it's best to store any of this in instructlab/instructlab if we want to advertise this as 'our RAG solution'

Copy link
Contributor Author

@jwm4 jwm4 Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure it's an issue at all that it's making a call to SDG since the instructlab/instructlab dependencies contain SDG anyway. We don't expect users to be using ilab without SDG

The issue is not the dependencies, it is the flow -- if document processing starts in core and then moves to SDG and then moves back to core, that flow will be more difficult to understand and maintain than if it was all in one place. This should be labeled as "con" , not "pro". I will fix that and add some more text to clarify what my concern is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core does serve as a sort of orchestrator so I'm not worried about that, but making Core required for SDG to interact with RAG may not be preferred - a definite con

- Con: Many things we will want to do to add advanced functionality to make RAG more effective will require changes to both indexing and run-time RAG. If those components are split across multiple repositories, that will make delivering such changes more complicated.
- Put the indexing and run-time RAG code in <https://github.com/instructlab/instructlab> (core)
- Pro: This has the advantage of not adding any new dependencies.
- Pro: However, since the existing document processing is in SDG, the flow for indexing for RAG would be a bit complicated (i.e., it starts with a CLI call handled by the core repo then goes to SDG for some of the document processing and then back to the core for vector database indexing). That drawback will be eliminated if/when the document processing moves into the core repository.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core does serve as a sort of orchestrator so I'm not worried about that, but making Core required for SDG to interact with RAG may not be preferred - a definite con

@anastasds
Copy link
Contributor

anastasds commented Dec 11, 2024

I think I am settling on advocating on a prototyping phase working within a submodule in the instructlab repo to figure out things that are shared concerns across the product like config, prompt template management, model interactions, model management, and more, and explicitly intend to extract something independently deployable into its own repo later.

It can be very difficult to draw component boundaries from the outset - things like the ones I listed above will need to be figured out across both retrieval and inference;
if we start with a submodule in the core repo, I think it's highly likely that we can extract it into its own repo in a cleaner way later.

What I want to avoid is having two places where there is config management, two places where prompt templates are defined, two places where there is model management (since we'll have embedding models too), and so on.

A directory level CODEOWNERS in the instructlab repo should give us the autonomy needed during the design exploration phase of things.

@jwm4 jwm4 changed the title Add ADR for New repository for RAG Add ADR for what repository will contain RAG Dec 13, 2024
Copy link
Member

@nathan-weinberg nathan-weinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me - @anastasds @jwm4 appreciate all the work y'all put into this! - I'd like @cdoern to approve as well but going ahead and giving this my sign-off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants