Replies: 10 comments 3 replies
-
Same here. The snippet below takes whopping 2s to run: embeddings = OllamaEmbeddings(base_url="http://localhost:11434", model="llama2", show_progress=True)
some_doc = Document(page_content="This is some text")
print(embeddings.embed_documents([some_doc.page_content]))
|
Beta Was this translation helpful? Give feedback.
-
same here, anyone knows how to solve this? |
Beta Was this translation helpful? Give feedback.
-
same here, I'm trying to write a simple chat application with agents, and each query takes ~350s to run
the model responds almost immediatly but when I use the python package for Ollama from langchain_community.llms.ollama import Ollama the performance takes a BIG toll, tbh I have no idea for what's causing this |
Beta Was this translation helpful? Give feedback.
-
Same problem here. from langchain_community.document_loaders import CSVLoader
from langchain.docstore.document import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import re
loader = CSVLoader(
"file.csv",
encoding="utf-8",
csv_args={
"delimiter": ",",
"quotechar": '"',
"fieldnames": [""],
}
)
documents = loader.load()
docs = [Document(page_content=re.sub("[\\s]+", " ", re.sub("[\\n\\r]", " ", document.page_content[2:]))) for document in documents]
modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device':'cuda'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
model_name=modelPath,
cache_folder="embeddings",
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs,
show_progress=True
)
db = FAISS.load_local("faiss_data", embeddings, allow_dangerous_deserialization=True)
retriever = db.as_retriever(search_kwargs={"k": 4}, search_type="similarity")
template = """
[INST]
Use the following context elements for answering the final question.
If you don't know the answer, say that you don't know it, don't try to guess the response.
{context}
Question: {question}
Answer:
[/INST]"""
custom_rag_prompt = ChatPromptTemplate.from_template(template)
model = "mistral:7b"
llm = ChatOllama(model=model, num_predict=512, num_ctx=3072, temperature=0.2, num_gpu=1)
rag_chain = (
{"context": retriever , "question": RunnablePassthrough()}
| custom_rag_prompt
| llm
| StrOutputParser()
)
rag_chain.invoke("Who left the company in the last week?") The rag_chain took 1.30 minutes to answer the question and if I don't use context it took at least 1 minute to answer. ollama run mistral:7b it answers in at most 5 seconds so I'm thinking that Langchain is causing a bottleneck between Python code and Ollama backend.
Thank you very much! |
Beta Was this translation helpful? Give feedback.
-
Langchain Ollama use pure |
Beta Was this translation helpful? Give feedback.
-
for those interested, in https://github.com/MaxiBoether/langchain-ollama-package I have prototyped using the ollama package also for e.g. the chat model, which massively speeds up the inference. I need to clean this a bit (tbh, Claude Opus did most of the work :D), might submit a PR later on. |
Beta Was this translation helpful? Give feedback.
-
Sorry for late response, was busy on daily work. Yes, technqiuely the current implementation is very poor. looks like @MaxiBoether has did some interesting work |
Beta Was this translation helpful? Give feedback.
-
Same issue stands for langchain js. I have tried with several models. All of these answering almost instantly in ollama cli. Langchain is getting at least one minute to deliver the answer for same question I have placed in the cli version. Please suggest if there is any fix available. |
Beta Was this translation helpful? Give feedback.
-
You can use streaming so you don't need to wait for the complete response to generate. |
Beta Was this translation helpful? Give feedback.
-
Discussing the IssueHi, nice to meet you! My name is Diego Cancino, and I faced a similar problem. Let me give some context: when I used OllamaLLM from langchain_ollama, I encountered the significant issue of excessive slowness. The hardware I was using included an NVIDIA RTX 3070 graphics card, so I knew GPU performance wasn’t the bottleneck. Frustration drove me to dig deeper into the issue. Initially, when you use Ollama with a freshly downloaded model via the ollama pull [model_name] command, the server optimizes the configurations to use the model efficiently. However, using OllamaLLM introduces configurations in the Ollama server that limit the response speed. You can imagine my confusion when running my Python script and seeing responses much slower than directly using the Ollama container. For reference, I ran the Ollama container with the following command: docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama Eventually, I discovered that there is a parameter within OllamaLLM (and specific classes like ChatOllama) called num_gpu. While some might interpret this as the total number of GPUs to use, desperation led me to experiment. Since an RTX 3070 has 5,888 CUDA cores, I hypothesized that setting the num_gpu parameter to a value similar to this number might help. By enabling CUDA cores with PyTorch and adjusting the num_gpu parameter to a value close to 5,888, I observed a significantly faster response compared to leaving it set to 1. For this reason, I suggest trying this configuration. Problematic Codefrom langchain_ollama.llms import OllamaLLM
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from arq_ai.faiss_index import FAISSIndexer
from uuid import UUID
from typing import AsyncGenerator
class FastAssistantChat:
TEMPLATE_BENEFITS = """
You are an assistant that provides concise benefits by summarizing context to add value to the salesperson during the sales process.
The output should be structured as follows:
Product Title
Benefits
- benefit 1
- benefit 2
If no context is provided, say: "I don't have product information" and do not attempt to create new information.
{context}
"""
def __init__(self):
self.__chat_model = OllamaLLM(
model="phi3:mini",
temperature=0.2
)
self.__rag_custom_prompt = PromptTemplate.from_template(self.TEMPLATE_BENEFITS)
self.__rag_chain = None
def generate_chain(self):
self.__rag_chain = (
{"context" : RunnablePassthrough()}
| self.__rag_custom_prompt
| self.__chat_model
)
async def get_product_benefits(self , id_product : UUID , indexer: FAISSIndexer) -> AsyncGenerator[str , None]:
if not self.__rag_chain:
raise RuntimeError("The RAG chain isn't initialized")
context = await indexer.query_benefits_sale(id_product)
if not context:
return "No product information found"
result = await self.__rag_chain.ainvoke({'context' : context})
return result
async def get_benefits_only_product_stream(self , id_product : UUID , indexer : FAISSIndexer) -> AsyncGenerator[str , None]:
if not self.__rag_chain:
raise RuntimeError("The RAG chain isn't initialized")
context = await indexer.query_benefits_sale(id_product)
if not context:
yield "No product information found"
async for chunk in self.__rag_chain.astream({'context' : context}):
print(chunk , end="" , flush=True)
yield chunk Refactored ConstructorI made some changes to the class constructor as follows: def __init__(self):
if torch.cuda.is_available():
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True
print("CUDA optimizations enabled...")
else:
print("CUDA is not available")
self.__chat_model = OllamaLLM(
model="phi3:mini",
temperature=0.2,
num_gpu=5800
)
self.__rag_custom_prompt = PromptTemplate.from_template(self.TEMPLATE_BENEFITS)
self.__rag_chain = None Similarly, I am creating embeddings for FAISS in my project and found it reasonable to use all-mini:v6-l2. You might consider using it as well and check the response times with OllamaEmbeddings. It might be very helpful for you. With these adjustments, I achieved significant improvements in my project. 😉 I hope this is helpful to you all! |
Beta Was this translation helpful? Give feedback.
-
I'm playing with Langchain and Ollama. My source text is 90 lines poem (each line max 50 characters).
First I load it into vector db (Chroma):
Execution time is about 25 seconds. Why so long?(!) For instance generating embeddings with SBERT is way shorter.
Then I use these vectors with Ollama model:
Execution time is... 26 seconds. Huge amount of time (really short text).
My hardware: Ryzen 7 5700x, 48GB RAM, gtx 1050ti
I tried different settings for chunk size, separator. Differences are trivial. Is there any trick I can speed it up?
Looks like GPU load is max 50%, CPU similar, RAM piratically not used.
Something wrong with the code?
Any suggestion appreciated,
Best
Beta Was this translation helpful? Give feedback.
All reactions