Questions? Just message us on Discord or create an issue in GitHub. We're happy to help live!
Table of Contents:
- Enterprise Knowledge Retrieval
- Overview
- Before you begin
- Deploy the starter kit GUI
- Use the starter kit
- Customizing the starter kit
- Third-party tools and data sources
This AI Starter Kit is an example of a semantic search workflow. You send your PDF or TXT file to the SambaNova platform, and get answers to questions about the documents content. The Kit includes:
- A configurable SambaNova Cloud or SambaStudio connector. The connector generates answers from a deployed model.
- A configurable integration with a third-party vector database.
- An implementation of a semantic search workflow using Langchain LCEL.
- Prompt construction strategies.
This sample is ready-to-use. We provide:
- Instructions for setup with SambaNova Cloud or SambaStudio.
- Instructions for running the model as is.
- Instructions for customizing the model.
You have to set up your environment before you can run or customize the starter kit.
Clone the starter kit repo.
git clone https://github.com/sambanova/ai-starter-kit.git
The next step is to set up your environment variables to use one of the inference models available from SambaNova. You can obtain a free API key through SambaNova Cloud. Alternatively, if you are a current SambaNova customer, you can deploy your models using SambaStudio.
-
SambaNova Cloud (Option 1): Follow the instructions here to set up your environment variables. Then, in the config file, set the
type
variable inllm_info
to"sncloud"
and set themodel
config depending on the model you want to use. -
SambaStudio (Option 2): Follow the instructions here to set up your endpoint and environment variables. Then, in the config file, set the
type
variable inllm_info
to"sambastudio"
, and set thebundle
andmodel
configs if you are using a bundle endpoint.
You have the following options to set up your embedding model:
-
CPU embedding model (Option 1): In the config file, set the variable
type
inembedding_model
to"cpu"
. -
SambaStudio embedding model (Option 2): To increase inference speed, you can use a SambaStudio embedding model endpoint instead of using the default (CPU) Hugging Face embedding. Follow the instructions here to set up your endpoint and environment variables. Then, in the config file, set the variable
type
inembedding_model
to"sambastudio"
, and set the configsbatch_size
,bundle
andselect_expert
according to your SambaStudio endpoint.
Choose your vector database from the accessible integrations to power your RAG performance. Simply access the config file, and under the retrieval
section, set the value of the variable db_type
with your choice. You have the following supported open-source options:
-
Chroma (default): Specifiy this option by setting it to
"db_type": "chroma"
. -
Milvus by Zilliz: Specifiy this option by setting it to
"db_type": "milvus"
.
- If you are using Windows, make sure your system has Microsoft Visual C++ Redistributable installed. You can install it from Microsoft Visual C++ Build Tools and make sure to check all boxes regarding C++ section. (Compatible versions: 2015, 2017, 2019 or 2022)
We recommend that you run the starter kit in a virtual environment or use a container. We also recommend using Python >= 3.10 and < 3.12.
If you want to use virtualenv or conda environment:
-
Install and update pip.
cd ai-starter-kit/enterprise_knowledge_retriever python3 -m venv enterprise_knowledge_env source enterprise_knowledge_env/bin/activate pip install -r requirements.txt
-
Run the following command:
streamlit run streamlit/app.py --browser.gatherUsageStats false
After deploying the starter kit you see the following user interface:
NOTE: If you are deploying the docker container in Windows be sure to open the docker desktop application.
To run the starter kit with docker, run the following command:
docker-compose up --build
You will be prompted to go to the link (http://localhost:8501/) in your browser where you will be greeted with the streamlit page as above.
After you've deployed the GUI, you can use the starter kit. Follow these steps:
-
In the Pick a datasource pane, either drag and drop files or browse to select them. The data source can be a series of PDF files or a Chroma vectorstore.
-
Click Process to process all loaded PDFs. This will create a vectorstore in memory, which you can optionally save to disk. Note: This step may take some time, particularly if you are processing large documents or using CPU-based embeddings.
-
In the main panel, you can ask questions about the PDF data.
This pipeline uses the AI starter kit as is with an ingestion, retrieval, and Q&A workflows. More details about each workflow are provided below:
Ingestion workflow
This workflow, included with this starter kit, is an example of parsing and indexing data for subsequent Q&A. The steps are:
-
Document parsing: Python packages like PyMuPDF or unstructured are used to extract text from file documents. On the LangChain website, multiple integrations for text extraction from multiple file types are available. Depending on the quality and the format of the files, this step might require customization for different use cases. This kit uses the parser util in the background for the document parsing step, which leverages either PyMuPDF or the unstructured module to parse the documents.
-
Split data: After the data has been parsed and its content extracted, it is necessary to split the data into chunks of text to be embedded and stored in a vector database. The size of the text chunks depends on the context (sequence) length offered by the model. Generally, larger context lengths result in better performance. The method used to split text also impacts performance; for instance, ensuring there are no word or sentence breaks is crucial. The downloaded data is split using the parser util, which leverages either PyMuPDF or the unstructured module to split the parsed documents into chunks.
-
Embed data: For each chunk of text from the previous step, we use an embeddings model to create a vector representation of the text. These embeddings are then used for storing and retrieving the most relevant content given a user's query. The split text is embedded using HuggingFaceInstructEmbeddings.
For more information about what embeddings are, click here.
-
Store embeddings: Embeddings for each chunk, along with content and relevant metadata (such as source documents) are stored in a vector database where the embedding acts as the index. In this template, we store information with each entry, which can be modified to suit your needs. Several vector database options are available, each with their own pros and cons. This starter kit is set up to use Chroma as the vector database because it is a free, open-source option with straightforward setup, but it can easily be updated to use another if desired. In terms of metadata,
filename
andpage
are also attached to the embeddings, which are extracted during parsing of the PDF documents.
Retrieval workflow
This workflow is an example of leveraging data stored in a vector database along with a large language model to enable retrieval-based Q&A from your data. The steps are:
-
Embed query: The first step is to convert a user-submitted query to a common representation (an embedding) for subsequent use in identifying the most relevant stored content. Use the same embedding mode for query parsing and to generate embeddings. In this start kit, the query text is embedded using HuggingFaceInstructEmbeddings, which is the same embedding model in the ingestion workflow.
-
Retrieve relevant content: Next, we use the embeddings representation of the query to make a retrieval request from the vector database, which in turn returns relevant entries (content) in it. Thus, the vector database also acts as a retriever for fetching relevant information.
For more information about embeddings and their retrieval, click here.
-
Rerank retrieved content After retrieving a specified number of relevant chunks of information, a reranker model can be set to rerank the retrieved passages in order of relevance to the user query. Then, the top N documents with the highest relevance scores are selected and passed to the QA chain as context.
For more information about retrieval augmented generation with LangChain, click here.
Q&A workflow
After the relevant information is retrieved, the content is sent to a SambaNova LLM to generate a final response to the user query.
Before being sent to the LLM, the user's query is combined with the retrieved content along with instructions to form the prompt. This process involves prompt engineering, and is an important part of ensuring quality output. In this AI starter kit, customized prompts are provided to the LLM to improve the quality of response for this use case.
To learn more about prompt engineering, click here.
You can further customize the starter kit based on the use case.
Different packages are available to extract text from different file documents. They can be broadly categorized as:
- OCR-based: pytesseract, paddleOCR, unstructured
- Non-OCR based: pymupdf, pypdf
Most of these packages have easy integrations with the Langchain library. You can find examples of the usage of these loaders in the Data extraction starter kit.
This enterprise knowledge retriever kit uses either PyMuPDF or a custom implementation of the unstructured loader. This can be configured in the config.yaml file:
-
If
pdf_only_mode
is set to True, then PyMuPDF is used as the data loader. Please note that in this case, only PDF documents are supported. -
If
pdf_only_mode
is set to False, then the unstructured loader is used, which works well with all file types. Please note that in this case, you need to install the following system dependencies if they are not already available on your system, for example, usingbrew install
for Mac. Depending on what document types you're parsing, you may not need all of these:libmagic-dev
(filetype detection)poppler
(images and PDFs)tesseract-ocr
(images and PDFs)qpdf
(PDFs)libreoffice
(MS Office docs)pandoc
(EPUBs)
You can also modify several parameters in the loading strategies by changing the ../utils/parsing/config.yaml file, see more here.
You can experiment with different ways of splitting the data, such as splitting by tokens or using context-aware splitting for code or markdown files. LangChain provides several examples of different kinds of splitting; see more here.
The chunking
inside the parser utils config, which is used in this starter kit, can be further customized using the chunk_max_characters
and chunk_overlap
parameters. For LLMs with a long sequence length, use a larger value of chunk_max_characters
to provide the LLM with broader context and improve performance. The chunk_overlap
parameter is used to maintain continuity between different chunks.
You can modify this and other parameters in the chunking
config in the ../utils/parsing/config.yaml; see more here.
Several open-source embedding models are available on Hugging Face. This leaderboard ranks these models based on the Massive Text Embedding Benchmark (MTEB). A number of these models, such as e5-large-v2 and e5-mistral-7b-instruct, are available on SambaStudio and can be further fine-tuned on specific datasets to improve performance.
To change the embedding model, do the following:
- If using CPU embedding (i.e.,
type
inembedding_model
is set to"cpu"
in the config.yaml file), e5-large-v2 from HuggingFaceInstruct is used by default. If you want to use another model, you will need to manually modify theEMBEDDING_MODEL
variable and theload_embedding_model()
function in the api_gateway.py. - If using SambaStudio embedding (i.e.,
type
inembedding_model
is set to"sambastudio"
in the config.yaml file), you will need to change the SambaStudio endpoint and/or the configsbatch_size
,bundle
andselect_expert
in the config file.
The template can be customized to use different vector databases to store the embeddings generated by the embedding model. The LangChain vector stores documentation provides a broad collection of vector stores that can be easily integrated.
By default, we use Chroma. You can change the vector store by setting db_type
in the create_vector_store()
function in document_retrieval.py.
A wide collection of retriever options is available. In this starter kit, the vector store is used as a retriever, but it can be enhanced and customized, as shown in some of the examples here.
You can do this modification in the config.yaml file:
"k_retrieved_documents": 15
"score_threshold": 0.2
"rerank": False
"reranker": 'BAAI/bge-reranker-large'
"final_k_retrieved_documents": 5
There, you will be able to select the final number of retrieved documents and decide whether to use the reranker:
- If
rerank
is set toFalse
, then no reranker is used, andfinal_k_retrieved_documents
represents the number of retrieved documents by the retriever. - If
rerank
is set toTrue
,k_retrieved_documents
first represent the number of documents retrieved by the retriever, andfinal_k_retrieved_documents
represents the final number of documents after reranking.
The implementation can be customized by modifying the get_qa_retrieval_chain()
function in the document_retrieval.py file.
Certain customizations to the LLM itself can affect the starter kit performance. To modify the parameters for calling the model, make changes to the config file. You can also set the values of temperature
and max_tokens_to_generate
in that file.
Prompting has a significant effect on the quality of LLM responses. Prompts can be further customized to improve the overall quality of the responses from the LLMs. For example, in this starter kit, the following prompt template was used to generate a response from the LLM, where question
is the user query and context
is the documents retrieved by the retriever.
template: |
<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a knowledge base assistant chatbot powered by Sambanova's AI chip accelerator, designed to answer questions based on user-uploaded documents.
Use the following pieces of retrieved context to answer the question. Each piece of context includes the Source for reference. If the question references a specific source, then filter out that source and give a response based on that source.
If the answer is not in the context, say: "This information isn't in my current knowledge base." Then, suggest a related topic you can discuss based on the available context.
Maintain a professional yet conversational tone. Do not use images or emojis in your answer.
Prioritize accuracy and only provide information directly supported by the context. <|eot_id|><|start_header_id|>user<|end_header_id|>
Question: {question}
Context: {context}
\n ------- \n
Answer: <|eot_id|><|start_header_id|>assistant<|end_header_id|>
You can make modifications to the prompt template in the following file:
file: prompts/qa_prompt.yaml
All the packages/tools are listed in the requirements.txt
file in the project directory.