diff --git a/docs/how-to/document-search/search_documents.md b/docs/how-to/document-search/search_documents.md new file mode 100644 index 00000000..c60f09d8 --- /dev/null +++ b/docs/how-to/document-search/search_documents.md @@ -0,0 +1,118 @@ +# How-To: Search Documents + +`ragbits-document-search` package comes with all functionalities required to perform document search. The whole process can be divided into 3 steps: +1. Load documents +2. Process documents, embedd them and store into the vector database +3. Do the search + +This guide will walk you through all those steps and explain the details. Let's start with a minimalistic example to get the main idea: +```python +import asyncio +from pathlib import Path + +from ragbits.core.embeddings.litellm import LiteLLMEmbeddings +from ragbits.core.vector_stores.in_memory import InMemoryVectorStore +from ragbits.document_search import DocumentSearch +from ragbits.document_search.documents.document import DocumentMeta +from ragbits.document_search.documents.sources import GCSSource + +async def main() -> None: + # Load documents (there are multiple possible sources) + documents = [ + DocumentMeta.from_local_path(Path("")), + DocumentMeta.create_text_document_from_literal("Test document"), + DocumentMeta.from_source(GCSSource(bucket="", object_name="")) + ] + + embedder = LiteLLMEmbeddings() + vector_store = InMemoryVectorStore() + document_search = DocumentSearch( + embedder=embedder, + vector_store=vector_store, + ) + + # Ingest documents - here they are processed, embed and stored + await document_search.ingest(documents) + + # Actual search + results = await document_search.search("I'm boiling my water and I need a joke") + print(results) + + +if __name__ == "__main__": + asyncio.run(main()) +``` + +## Documents loading +Before doing any search we need to have some documents that will build our knowledge base. Ragbits offers a handy class `Document` that stores all the information needed for document loading. +Objects of this class are usually instantiated using `DocumentMeta` helper class that supports loading files from your local storage, GCS or HuggingFace. +You can easily add support for your custom sources by extending the `Source` class and implementing the abstract methods: +```python +from pathlib import Path + +from ragbits.document_search.documents.sources import Source + +class CustomSource(Source): + @property + def id(self) -> str: + pass + + async def fetch(self) -> Path: + pass +``` + +## Processing, embedding and storing +Having the documents loaded we can proceed with the pipeline. The next step covers the processing, embedding and storing. Embeddings and Vector Stores have their own sections in the documentation, +here we will focus on the processing. + +Before a document can be ingested into the system it needs to be processed into a collection of elements that the system supports. Right now there are two supported elements: +`TextElement` and `ImageElement`. You can introduce your own elements by simply extending the `Element` class. + +Depending on a type of the document there are different `providers` that work under the hood to return a list of supported elements. Ragbits rely mainly on [Unstructured](https://unstructured.io/) +library that supports parsing and chunking of most common document types (i.e. pdf, md, doc, jpg). You can specify a mapping of file type to provider when creating document search instance: +```python +from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter +from ragbits.document_search.documents.document import DocumentType +from ragbits.document_search.ingestion.providers.unstructured.default import UnstructuredDefaultProvider + +document_search = DocumentSearch( + embedder=embedder, + vector_store=vector_store, + document_processor_router=DocumentProcessorRouter({DocumentType.TXT: UnstructuredDefaultProvider()}) +) +``` + +If you want to implement a new provider you should extend the `BaseProvider` class: +```python +from ragbits.document_search.documents.document import DocumentMeta, DocumentType +from ragbits.document_search.documents.element import Element +from ragbits.document_search.ingestion.providers.base import BaseProvider + + +class CustomProvider(BaseProvider): + SUPPORTED_DOCUMENT_TYPES = { DocumentType.TXT } # provide supported document types + + async def process(self, document_meta: DocumentMeta) -> list[Element]: + pass +``` + +## Search +After storing indexed documents in the system we can move to the search part. It is very simple and straightforward, you simply need to call `search()` function. +The response will be a sequence of elements that are the most similar to provided query. + +## Advanced configuration +There is an additional functionality of `DocumentSearch` class that allows to provide a config with complete setup. +```python +config = { + "embedder": {...}, + "vector_store": {...}, + "reranker": {...}, + "providers": {...}, + "rephraser": {...}, +} + +document_search = DocumentSearch.from_config(config) +``` +For a complete example please refer to `examples/document-search/from_config.py` + +If you want to improve your search results you could read more on how to adjust [QueryRephraser](use_rephraser.md) or [Reranker](use_reranker.md). \ No newline at end of file diff --git a/docs/how-to/document-search/use_rephraser.md b/docs/how-to/document-search/use_rephraser.md new file mode 100644 index 00000000..67498e72 --- /dev/null +++ b/docs/how-to/document-search/use_rephraser.md @@ -0,0 +1,68 @@ +# How-To: Use Rephraser +`ragbits-document-search` contains a `QueryRephraser` module that could be used for creating an additional query that +improves the original user query (fixes typos, handles abbreviations etc.). Those two queries are then sent to the document search +module that can use them to find better matches. + +This guide will show you how to use `QueryRephraser` and how to create your custom implementation. + +## LLM rephraser usage +To use a rephraser within retrival pipeline you need to provide it during `DocumentSearch` construction. In the following example we will use +`LLMQueryRephraser` and default `QueryRephraserPrompt`. +```python +import asyncio +from ragbits.core.llms.litellm import LiteLLM +from ragbits.document_search import DocumentSearch +from ragbits.document_search.retrieval.rephrasers.llm import LLMQueryRephraser +from ragbits.document_search.retrieval.rephrasers.prompts import QueryRephraserPrompt + +async def main(): + document_search = DocumentSearch( + query_rephraser=LLMQueryRephraser(LiteLLM("gpt-3.5-turbo"), QueryRephraserPrompt), + ... + ) + results = await document_search.search("") + +asyncio.run(main()) +``` + +The next example will show on how to use the same rephraser as independent component: + +```python +import asyncio +from ragbits.document_search.retrieval.rephrasers.llm import LLMQueryRephraser +from ragbits.document_search.retrieval.rephrasers.prompts import QueryRephraserPrompt +from ragbits.core.llms.litellm import LiteLLM + + +async def main(): + rephraser = LLMQueryRephraser(LiteLLM("gpt-3.5-turbo"), QueryRephraserPrompt) + rephrased = await rephraser.rephrase("Wht tim iz id?") + print(rephrased) + +asyncio.run(main()) +``` +The console should print: +```text +['What time is it?'] +``` + +To change the prompt you need to create your own class in the following way: +```python +from ragbits.core.prompt import Prompt +from ragbits.document_search.retrieval.rephrasers.llm import QueryRephraserInput + +class QueryRephraserPrompt(Prompt[QueryRephraserInput, str]): + user_prompt = "{{ query }}" + system_prompt = ("") +``` +You should only change the `system_prompt` as the `user_prompt` will contain a query passed to `DocumentSearch.search()` later. + +## Custom rephraser +It is possible to create a custom rephraser by extending the base class: +```python +from ragbits.document_search.retrieval.rephrasers.base import QueryRephraser + +class CustomRephraser(QueryRephraser): + async def rephrase(self, query: str) -> list[str]: + pass +``` \ No newline at end of file diff --git a/docs/how-to/document-search/use_reranker.md b/docs/how-to/document-search/use_reranker.md new file mode 100644 index 00000000..b494ed66 --- /dev/null +++ b/docs/how-to/document-search/use_reranker.md @@ -0,0 +1,90 @@ +# How-To: Use Reranker +`ragbits-document-search` contains a `Reranker` module that could be used to select the most relevant and high-quality information from a set of retrieved documents. + +This guide will show you how to use `LiteLLMReranker` and how to create your custom implementation. + + +## LLM Reranker +`LiteLLMReranker` is based on [litellm.rerank()](https://docs.litellm.ai/docs/rerank) that supports three providers: Cohere, Azure AI, Together AI. +You will need to set a proper API key to use the reranking functionality. + +To use a `LiteLLMReranker` within retrival pipeline you simply need to provide it as an argument to `DocumentSearch`. +```python +import os +from ragbits.document_search.retrieval.rerankers.litellm import LiteLLMReranker + +os.environ["COHERE_API_KEY"] = "" + +document_search = DocumentSearch( + reranker=LiteLLMReranker("cohere/rerank-english-v3.0"), + ... +) +``` + +The next example will show on how to use the basic usage of the same re-ranker as independent component: + +```python +import asyncio +import os +from ragbits.document_search.retrieval.rerankers.litellm import LiteLLMReranker +from ragbits.document_search.documents.element import TextElement +from ragbits.document_search.documents.document import DocumentMeta + +os.environ["COHERE_API_KEY"] = "" + + +def create_text_element(text: str) -> TextElement: + document_meta = DocumentMeta.create_text_document_from_literal(content=text) + text_element = TextElement(document_meta=document_meta, content=text) + return text_element + + +async def main(): + reranker = LiteLLMReranker(model="cohere/rerank-english-v3.0") + text_elements = [ + create_text_element( + text="The artificial inteligence development is a milestone for global information accesibility" + ), + create_text_element(text="The redpill will show you the true nature of things"), + create_text_element(text="The bluepill will make you stay in the state of ignorance"), + ] + query = "Take the pill and follow the rabbit!" + ranked = await reranker.rerank(elements=text_elements, query=query) + for element in ranked: + print(element.content + "\n") + + +asyncio.run(main()) +``` + +The console should print the contents of the ranked elements in order of their relevance to the query, as determined by the model. + +```text +The redpill will show you the true nature of things + +The bluepill will make you stay in the state of ignorance + +The artificial inteligence development is a milestone for global information accesibility +``` + +## Custom Reranker +To create a custom Reranker you need to extend the `Reranker` class: +```python +from collections.abc import Sequence + +from ragbits.document_search.retrieval.rerankers.base import Reranker, RerankerOptions +from ragbits.document_search.documents.element import Element + +class CustomReranker(Reranker): + async def rerank( + self, + elements: Sequence[Element], + query: str, + options: RerankerOptions | None = None, + ) -> Sequence[Element]: + pass + + @classmethod + def from_config(cls, config: dict) -> "CustomReranker": + pass +``` \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index ae87765f..8b6783f3 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -13,6 +13,9 @@ nav: - Document Search: - how-to/document_search/async_processing.md - how-to/document_search/create_custom_execution_strategy.md + - how-to/document-search/search_documents.md + - how-to/document-search/use_rephraser.md + - how-to/document-search/use_reranker.md - API Reference: - Core: - api_reference/core/prompt.md