Skip to content

Commit

Permalink
langchain[minor]: Updated DocugamiLoader, includes breaking changes (#…
Browse files Browse the repository at this point in the history
…13265)

There are the following main changes in this PR:

1. Rewrite of the DocugamiLoader to not do any XML parsing of the DGML
format internally, and instead use the `dgml-utils` library we are
separately working on. This is a very lightweight dependency.
2. Added MMR search type as an option to multi-vector retriever, similar
to other retrievers. MMR is especially useful when using Docugami for
RAG since we deal with large sets of documents within which a few might
be duplicates and straight similarity based search doesn't give great
results in many cases.

We are @docugami on twitter, and I am @tjaffri

---------

Co-authored-by: Taqi Jaffri <[email protected]>
  • Loading branch information
tjaffri and Taqi Jaffri authored Nov 28, 2023
1 parent a20e8f8 commit 144710a
Show file tree
Hide file tree
Showing 9 changed files with 913 additions and 583 deletions.
457 changes: 346 additions & 111 deletions docs/docs/integrations/document_loaders/docugami.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/docs/integrations/providers/docugami.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@


```bash
pip install lxml
pip install dgml-utils
```

## Document Loader
Expand Down
39 changes: 36 additions & 3 deletions docs/docs/modules/data_connection/retrievers/multi_vector.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@
{
"data": {
"text/plain": [
"Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '10e9cbc0-4ba5-4d79-a09b-c033d1ba7b01', 'source': '../../state_of_the_union.txt'})"
"Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '455205f7-bb7d-4c36-b442-d1d6f9f701ed', 'source': '../../state_of_the_union.txt'})"
]
},
"execution_count": 8,
Expand All @@ -165,7 +165,7 @@
{
"data": {
"text/plain": [
"9874"
"9875"
]
},
"execution_count": 9,
Expand All @@ -178,6 +178,39 @@
"len(retriever.get_relevant_documents(\"justice breyer\")[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "cdef8339-f9fa-4b3b-955f-ad9dbdf2734f",
"metadata": {},
"source": [
"The default search type the retriever performs on the vector database is a similarity search. LangChain Vector Stores also support searching via [Max Marginal Relevance](https://api.python.langchain.com/en/latest/schema/langchain.schema.vectorstore.VectorStore.html#langchain.schema.vectorstore.VectorStore.max_marginal_relevance_search) so if you want this instead you can just set the `search_type` property as follows:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "36739460-a737-4a8e-b70f-50bf8c8eaae7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9875"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.retrievers.multi_vector import SearchType\n",
"\n",
"retriever.search_type = SearchType.mmr\n",
"\n",
"len(retriever.get_relevant_documents(\"justice breyer\")[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "d6a7ae0d",
Expand Down Expand Up @@ -576,7 +609,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
"version": "3.9.16"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 144710a

Please sign in to comment.