From efed0cc32efa29b8258eb699c51d9fe32af7250b Mon Sep 17 00:00:00 2001 From: InAnYan Date: Mon, 29 Jul 2024 14:36:03 +0300 Subject: [PATCH] Fix from code review + ADR --- CHANGELOG.md | 2 +- docs/decisions/0033-store-chats-in-mvstore.md | 1 + .../0037-rag-architecture-implementation.md | 107 ++++++++++++++++++ src/main/java/module-info.java | 1 + src/main/java/org/jabref/gui/Dark.css | 1 - .../chatmessage/ChatMessageComponent.java | 2 +- .../errorstate/ErrorStateComponent.fxml | 2 +- .../errorstate/ErrorStateComponent.java | 5 +- .../org/jabref/gui/preferences/ai/AiTab.java | 4 +- .../logic/ai/models/EmbeddingModel.java | 6 +- src/main/resources/tinylog.properties | 3 +- 11 files changed, 122 insertions(+), 12 deletions(-) create mode 100644 docs/decisions/0037-rag-architecture-implementation.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 7638f7fee41..14eede556a1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,7 +11,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv ### Added -- We added an AI chat for linked files. [#11430](https://github.com/JabRef/jabref/pull/11430) +- We added an AI-based chat for entries with linked PDF files. [#11430](https://github.com/JabRef/jabref/pull/11430) - We added support for selecting and using CSL Styles in JabRef's OpenOffice/LibreOffice integration for inserting bibliographic and in-text citations into a document. [#2146](https://github.com/JabRef/jabref/issues/2146), [#8893](https://github.com/JabRef/jabref/issues/8893) - We added Tools > New library based on references in PDF file... to create a new library based on the references section in a PDF file. [#11522](https://github.com/JabRef/jabref/pull/11522) - When converting the references section of a paper (PDF file), more than the last page is treated. [#11522](https://github.com/JabRef/jabref/pull/11522) diff --git a/docs/decisions/0033-store-chats-in-mvstore.md b/docs/decisions/0033-store-chats-in-mvstore.md index 5c29396ab3d..7bd55002953 100644 --- a/docs/decisions/0033-store-chats-in-mvstore.md +++ b/docs/decisions/0033-store-chats-in-mvstore.md @@ -42,6 +42,7 @@ Chosen option: "MVStore", because it is simple and memory-efficient. * Good, because automatic loading and saving to disk * Good, because memory-efficient +* Bad, because does not support mutable values in maps. * Bad, because the order of messages need to be "hand-crafted" (e.g., by mapping from an Integer to the concrete message) * Bad, because it stores data as key-values, but not as a custom data type (like tables in RDBMS) diff --git a/docs/decisions/0037-rag-architecture-implementation.md b/docs/decisions/0037-rag-architecture-implementation.md new file mode 100644 index 00000000000..d38911ba83e --- /dev/null +++ b/docs/decisions/0037-rag-architecture-implementation.md @@ -0,0 +1,107 @@ +--- +nav_order: 0037 +parent: Decision Records +--- + +# RAG architecture implementation + +## Context and Problem Statement + +The current trend in questions and answering (Q&A) using large language models (LLMs) or other +AI related technology is retrieval-augmented-generation (RAG). + +RAG is related to [Open Generative QA](https://huggingface.co/tasks/question-answering) +that means LLM (which generates text) is supplied with context (chunks of information extracted +from various sources) and then it generates answer. + +RAG architecture consists of [these steps](https://www.linkedin.com/pulse/rag-architecture-deep-dive-frank-denneman-4lple) (simplified): + +How source data is processed: +1. **Indexing**: application is supplied with information sources (PDFs, text files, web pages, etc.) +2. **Conversion**: files are converted to string (because LLM works on text data). +3. **Splitting**: the string from previous step is split into parts (because LLM has fixed context window, meaning +it cannot handle big documents). +4. **Embedding generation**: a vector consisting of float values is generated out of chunks. This vector represents meaning +of text and the main propety of such vectors is that chunks with similar meaning has vectors that are close to. +Generation of such a vector is achieved by using a separate model called *embedding model*. +5. **Store**: chunks with relevant metadata (for example, from which document they were generated) and embedding vector are stored in a vector database. + +How answer is generated: +1. **Ask**: user asks AI a question. +2. **Question embedding**: an embedding model generates embedding vector of a query. +3. **Data finding**: vector database performs search of most relevant pieces of information (a finite count of pieces). +That's performed by vector similarity: meaning how close are chunk vector with question vector. +4. **Prompt generation**: using a prompt template the user question is *augmented* with found information. Found information +is not generally supplied to user, as it may seem strange that a user asked a question that was already supplied with +found information. These pieces of text can be either totally ignored or showed separately in UI tab "Sources". +5. **LLM generation**: LLM generates output. + +This ADR concerns about implementation of this architecture. + +## Decision Drivers + +* Prefer good and maintained libraries over self-made solutions for better quality. +* The usage of framework should be easy. It would seem strange when user wants to download a BIB editor, but they are +required to install some separate software (or even Python runtime). +* RAG shouldn't provide any additional money costs. Users should pay only for LLM generation. + +## Considered Options + +* Use a hand-crafted RAG +* Use a third-party Java library +* Use a standalone application +* Use an online service + +## Decision Outcome + +Chosen option: mix of "Use a hand-crafted RAG" and "Use a third-party Java library". + +Third-party libraries provide excellent resources for connecting to an LLM or extracting text from PDF files. For RAG, +we mostly used all the machinery provided by `langchain4j`, but there were moments that should be hand-crafted: +- **LLM connection**: due to https://github.com/langchain4j/langchain4j/issues/1454 (https://github.com/InAnYan/jabref/issues/77) + this was delegated to another library `jvm-openai`. +- **Embedding generation**: due to https://github.com/langchain4j/langchain4j/issues/1492 (https://github.com/InAnYan/jabref/issues/79), + this was delegated to another library `djl`. +- **Indexing**: `langchain4j` is just a bunch of useful tools, but we still have to orchestrate when indexing should +happen and what files should be processed. +- **Vector database**: there seems to be no embedded vector database (except SQLite with `sqlite-vss` extension). We +implemented vector database using `MVStore` because that was easy. + +## Pros and Cons of the Options + +### Use a hand-crafted RAG + +* Good, because we have the full control over generation +* Good, because extendable +* Bad, because LLM connection, embedding models, vector storage, and file conversion should be implemented manually +* Bad, because it's hard to make a complex RAG architecture + +### Use a third-party Java library + +* Good, because provides well-tested and maintained tools +* Good, because libraries have many LLM integrations, as well as embedding models, vector storage, and file conversion tools +* Good, because they provide complex RAG pipelines and extensions +* Neutral, because they provide many tools and functions, but they should be orchestrated in a real application +* Bad, because some of them are raw and undocumented +* Bad, because they are all similar to `langchain` +* Bad, because they may have bugs + +### Use a standalone application + +* Good, because they provide complex RAG pipelines and extensions +* Good, because no additional code is required (except connecting to API) +* Neutral, because they provide not that many LLM integrations, embedding models, and vector storages +* Bad, because a standalone app running is required. Users may be required to set it up properly +* Bad, because the internal working of app is hidden. Additional agreement to Privacy or Terms of Service is needed +* Bad, because hard to extend + +### Use an online service + +* Good, because all data is processed and stored not on the user's machine: faster and no memory is used. +* Good, because they provide complex RAG pipelines and extensions +* Good, because no additional code is required (except connecting to API) +* Neutral, because they provide not that many LLM integrations, embedding models, and vector storages +* Bad, because requires connection to Internet +* Bad, because data is processed by a third party company +* Bad, because most of them require additional payment (in fact, it would be impossible to develop a free service like +that) diff --git a/src/main/java/module-info.java b/src/main/java/module-info.java index 0a39c305ce7..e8a1165c003 100644 --- a/src/main/java/module-info.java +++ b/src/main/java/module-info.java @@ -154,4 +154,5 @@ // Provides number input fields for parameters in AI expert settings requires com.dlsc.unitfx; requires de.saxsys.mvvmfx.validation; + requires dd.plist; } diff --git a/src/main/java/org/jabref/gui/Dark.css b/src/main/java/org/jabref/gui/Dark.css index 56ddd4cc167..0e426159450 100644 --- a/src/main/java/org/jabref/gui/Dark.css +++ b/src/main/java/org/jabref/gui/Dark.css @@ -163,4 +163,3 @@ .file-row-text { -fx-text-fill: -fx-light-text-color; } - diff --git a/src/main/java/org/jabref/gui/ai/components/chatmessage/ChatMessageComponent.java b/src/main/java/org/jabref/gui/ai/components/chatmessage/ChatMessageComponent.java index 020696a18e3..3b66013dbab 100644 --- a/src/main/java/org/jabref/gui/ai/components/chatmessage/ChatMessageComponent.java +++ b/src/main/java/org/jabref/gui/ai/components/chatmessage/ChatMessageComponent.java @@ -46,7 +46,7 @@ private void initialize() { sourceLabel.setText(Localization.lang("AI")); contentTextArea.setText(aiMessage.text()); } else { - LOGGER.warn("ChatMessageComponent supports only user or AI messages, but other type was passed: " + chatMessage.type().name()); + LOGGER.warn("ChatMessageComponent supports only user or AI messages, but other type was passed: {}", chatMessage.type().name()); } } diff --git a/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.fxml b/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.fxml index c21d803c876..fbf69d45bce 100644 --- a/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.fxml +++ b/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.fxml @@ -6,7 +6,7 @@
- + diff --git a/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.java b/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.java index 358cc55af3b..c993809f935 100644 --- a/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.java +++ b/src/main/java/org/jabref/gui/ai/components/errorstate/ErrorStateComponent.java @@ -12,6 +12,7 @@ public class ErrorStateComponent extends BorderPane { @FXML private Text titleText; @FXML private Text contentText; + @FXML private VBox contentsVBox; public ErrorStateComponent(String title, String content) { ViewLoader.view(this) @@ -25,7 +26,7 @@ public ErrorStateComponent(String title, String content) { public static ErrorStateComponent withSpinner(String title, String content) { ErrorStateComponent errorStateComponent = new ErrorStateComponent(title, content); - ((VBox) errorStateComponent.getCenter()).getChildren().add(new ProgressIndicator()); + errorStateComponent.contentsVBox.getChildren().add(new ProgressIndicator()); return errorStateComponent; } @@ -36,7 +37,7 @@ public static ErrorStateComponent withTextArea(String title, String content, Str TextArea textArea = new TextArea(additional); textArea.setEditable(false); - ((VBox) errorStateComponent.getCenter()).getChildren().add(textArea); + errorStateComponent.contentsVBox.getChildren().add(textArea); return errorStateComponent; } diff --git a/src/main/java/org/jabref/gui/preferences/ai/AiTab.java b/src/main/java/org/jabref/gui/preferences/ai/AiTab.java index 0ca87121bdc..fc3ea7ad4a0 100644 --- a/src/main/java/org/jabref/gui/preferences/ai/AiTab.java +++ b/src/main/java/org/jabref/gui/preferences/ai/AiTab.java @@ -40,8 +40,6 @@ public class AiTab extends AbstractPreferenceTabView implements @FXML private IntegerInputField ragMaxResultsCountTextField; @FXML private DoubleInputField ragMinScoreTextField; - private final ControlsFxVisualizer visualizer = new ControlsFxVisualizer(); - @FXML private Button chatModelHelp; @FXML private Button embeddingModelHelp; @FXML private Button apiBaseUrlHelp; @@ -54,6 +52,8 @@ public class AiTab extends AbstractPreferenceTabView implements @FXML private Button resetExpertSettingsButton; + private final ControlsFxVisualizer visualizer = new ControlsFxVisualizer(); + public AiTab() { ViewLoader.view(this) .root(this) diff --git a/src/main/java/org/jabref/logic/ai/models/EmbeddingModel.java b/src/main/java/org/jabref/logic/ai/models/EmbeddingModel.java index 76871a77231..e8b6abdc0ab 100644 --- a/src/main/java/org/jabref/logic/ai/models/EmbeddingModel.java +++ b/src/main/java/org/jabref/logic/ai/models/EmbeddingModel.java @@ -23,11 +23,13 @@ import dev.langchain4j.model.output.Response; /** - * Wrapper around langchain4j embedding model. + * Wrapper around langchain4j {@link dev.langchain4j.model.embedding.EmbeddingModel}. *

* This class listens to preferences changes. */ public class EmbeddingModel implements dev.langchain4j.model.embedding.EmbeddingModel, AutoCloseable { + private static final String DJL_AI_DJL_HUGGINGFACE_PYTORCH_SENTENCE_TRANSFORMERS = "djl://ai.djl.huggingface.pytorch/sentence-transformers/"; + private final AiPreferences aiPreferences; private final ExecutorService executorService = Executors.newCachedThreadPool( @@ -48,7 +50,7 @@ private void rebuild() { return; } - String modelUrl = "djl://ai.djl.huggingface.pytorch/sentence-transformers/" + aiPreferences.getEmbeddingModel().getLabel(); + String modelUrl = DJL_AI_DJL_HUGGINGFACE_PYTORCH_SENTENCE_TRANSFORMERS + aiPreferences.getEmbeddingModel().getLabel(); Criteria criteria = Criteria.builder() diff --git a/src/main/resources/tinylog.properties b/src/main/resources/tinylog.properties index d7ce45ccf80..e19b65c23f2 100644 --- a/src/main/resources/tinylog.properties +++ b/src/main/resources/tinylog.properties @@ -12,8 +12,7 @@ exception = strip: jdk.internal level@org.jabref.http.server.Server = debug -# FIXME: Remove before merging the branch - +# AI debugging #level@org.jabref.gui.entryeditor.aichattab.AiChat = trace #level@org.jabref.gui.JabRefGUI = trace #level@org.jabref.logic.ai.AiService = trace