-
Notifications
You must be signed in to change notification settings - Fork 37
Add ADR for what repository will contain RAG #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
- There will be a new repository for RAG. | ||
- It will be located at https://github.com/instructlab/retrieval-augmented-generation | ||
- By mid-January, it will be available and working but not integrated with InstructLab. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there some rationale behind these two dates? just to hold ourselves accountable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dates are driven mainly by availability of people to do the work. The mid-March elements of the plan will require a lot of coordination with the maintainers of the core instructlab/instructlab repo. Those people are mostly already committed to other activities through mid-January.
- By mid-March, it will be integrated with InstructLab with the new repository being invoked by the core repository and maybe also by the SDG repository. | ||
- Eventually, it will be integrated with InstructLab with the new repository being invoked only by the core repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree with necessarily assigning dates to these efforts from now. I think it does make sense to split up these efforts into phases, and then we can internally scope out work for the desired releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the dates are important, but I am fine with dropping them from this document and using other venues to get aligned around these dates. I will do that.
- Con: Many things we will want to do to add advanced functionality to make RAG more effective will require changes to both indexing and run-time RAG. If those components are split across multiple repositories, that will make delivering such changes more complicated. | ||
- Put the indexing and run-time RAG code in <https://github.com/instructlab/instructlab> (core) | ||
- Pro: This has the advantage of not adding any new dependencies. | ||
- Pro: However, since the existing document processing is in SDG, the flow for indexing for RAG would be a bit complicated (i.e., it starts with a CLI call handled by the core repo then goes to SDG for some of the document processing and then back to the core for vector database indexing). That drawback will be eliminated if/when the document processing moves into the core repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure it's an issue at all that it's making a call to SDG since the instructlab/instructlab
dependencies contain SDG anyway. We don't expect users to be using ilab
without SDG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however, I do agree with the below cons and don't think it's best to store any of this in instructlab/instructlab
if we want to advertise this as 'our RAG solution'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure it's an issue at all that it's making a call to SDG since the instructlab/instructlab dependencies contain SDG anyway. We don't expect users to be using ilab without SDG
The issue is not the dependencies, it is the flow -- if document processing starts in core and then moves to SDG and then moves back to core, that flow will be more difficult to understand and maintain than if it was all in one place. This should be labeled as "con" , not "pro". I will fix that and add some more text to clarify what my concern is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core
does serve as a sort of orchestrator so I'm not worried about that, but making Core required for SDG to interact with RAG may not be preferred - a definite con
- Con: Many things we will want to do to add advanced functionality to make RAG more effective will require changes to both indexing and run-time RAG. If those components are split across multiple repositories, that will make delivering such changes more complicated. | ||
- Put the indexing and run-time RAG code in <https://github.com/instructlab/instructlab> (core) | ||
- Pro: This has the advantage of not adding any new dependencies. | ||
- Pro: However, since the existing document processing is in SDG, the flow for indexing for RAG would be a bit complicated (i.e., it starts with a CLI call handled by the core repo then goes to SDG for some of the document processing and then back to the core for vector database indexing). That drawback will be eliminated if/when the document processing moves into the core repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core
does serve as a sort of orchestrator so I'm not worried about that, but making Core required for SDG to interact with RAG may not be preferred - a definite con
I think I am settling on advocating on a prototyping phase working within a submodule in the It can be very difficult to draw component boundaries from the outset - things like the ones I listed above will need to be figured out across both retrieval and inference; What I want to avoid is having two places where there is config management, two places where prompt templates are defined, two places where there is model management (since we'll have embedding models too), and so on. A directory level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me - @anastasds @jwm4 appreciate all the work y'all put into this! - I'd like @cdoern to approve as well but going ahead and giving this my sign-off
No description provided.