Why is ollama running slowly? #18515

gosforth · 2024-03-04T18:48:53Z

gosforth
Mar 4, 2024

I'm playing with Langchain and Ollama. My source text is 90 lines poem (each line max 50 characters).
First I load it into vector db (Chroma):

from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import CharacterTextSplitter

# load the document and split it into chunks
loader = TextLoader("c:/test/some_source.txt", encoding="utf8")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=2500, chunk_overlap=0, separator=".")
docs = text_splitter.split_documents(documents)

# Create Ollama embeddings and vector store
embeddings = OllamaEmbeddings(model="mistral")

# load it into Chroma
db = Chroma.from_documents(docs, embeddings, persist_directory="c:/test/Ollama/RAG/data")

# save db
db.persist()

Execution time is about 25 seconds. Why so long?(!) For instance generating embeddings with SBERT is way shorter.
Then I use these vectors with Ollama model:

from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

# reset DB variable
db=None

embeddings = OllamaEmbeddings(model="mistral")

# read from Chroma
db = Chroma(persist_directory="c:/test/Ollama/RAG/data", embedding_function=embeddings)

llm = Ollama(base_url='http://localhost:11434', model="mistral", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever(search_type="similarity", search_kwargs={"k": 2})
)

question = "Here comes the question text?"
result = qa_chain.invoke({"query": question})
result["result"]
print(result)

# delete collection
db.delete_collection()

Execution time is... 26 seconds. Huge amount of time (really short text).
My hardware: Ryzen 7 5700x, 48GB RAM, gtx 1050ti

I tried different settings for chunk size, separator. Differences are trivial. Is there any trick I can speed it up?
Looks like GPU load is max 50%, CPU similar, RAM piratically not used.

Something wrong with the code?
Any suggestion appreciated,

Best

btonasse · 2024-03-28T14:10:49Z

btonasse
Mar 28, 2024

Same here. The snippet below takes whopping 2s to run:

embeddings = OllamaEmbeddings(base_url="http://localhost:11434", model="llama2", show_progress=True)
some_doc = Document(page_content="This is some text")
print(embeddings.embed_documents([some_doc.page_content]))

Windows 11 (ollama server running via wsl)
A 24 core 13th gen Intel CPU
64Gb RAM
RTX 4080 with 16Gb

0 replies

michelle-w-br · 2024-04-02T18:29:23Z

michelle-w-br
Apr 2, 2024

same here, anyone knows how to solve this?

0 replies

enzoenrico · 2024-04-04T17:04:24Z

enzoenrico
Apr 4, 2024

same here, I'm trying to write a simple chat application with agents, and each query takes ~350s to run
I did some testing, and it seems like Ollama is the culprit for slowing down the program, but the thing is, when running

ollama run mistral

the model responds almost immediatly

but when I use the python package for Ollama

from langchain_community.llms.ollama import Ollama

the performance takes a BIG toll, tbh I have no idea for what's causing this

0 replies

FedericoBianca · 2024-04-10T09:33:33Z

FedericoBianca
Apr 10, 2024

Same problem here.
I'm developing a RAG-based chatbot using a CSV file (around 800KB) as source of truth but ChatOllama is really slow in responding to user queries.
I developed the following snippet:

from langchain_community.document_loaders import CSVLoader
from langchain.docstore.document import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import re

loader = CSVLoader(
  "file.csv", 
  encoding="utf-8",
  csv_args={
      "delimiter": ",",
      "quotechar": '"',
      "fieldnames": [""],
  }
)
documents = loader.load()
docs = [Document(page_content=re.sub("[\\s]+", " ", re.sub("[\\n\\r]", " ", document.page_content[2:]))) for document in documents]
modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device':'cuda'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    cache_folder="embeddings",
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    show_progress=True
)
db = FAISS.load_local("faiss_data", embeddings, allow_dangerous_deserialization=True)
retriever = db.as_retriever(search_kwargs={"k": 4}, search_type="similarity")
template = """
[INST]
Use the following context elements for answering the final question.
If you don't know the answer, say that you don't know it, don't try to guess the response.

{context}

Question: {question}

Answer:
[/INST]"""
custom_rag_prompt = ChatPromptTemplate.from_template(template)

model = "mistral:7b"
llm = ChatOllama(model=model, num_predict=512, num_ctx=3072, temperature=0.2, num_gpu=1)
rag_chain = (
    {"context": retriever , "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)
rag_chain.invoke("Who left the company in the last week?")

The rag_chain took 1.30 minutes to answer the question and if I don't use context it took at least 1 minute to answer.
Instead, if I run

ollama run mistral:7b

it answers in at most 5 seconds so I'm thinking that Langchain is causing a bottleneck between Python code and Ollama backend.
I've been experimenting with LangChain for three weeks so it's possible that something is missing in my code but, reading the official docs, I cannot find anything else to optimize inference time for Mistral using Ollama (since it should automatically choose GPU configuration through CUDA).
My machine has the following properties:

Processor: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 2.59 GHz
RAM: 112 GB
GPU: Nvidia Tesla M60 with 64GB of memory
OS: Windows Server 2022 (64 bit)

Thank you very much!

0 replies

JokeJason · 2024-04-14T14:30:28Z

JokeJason
Apr 14, 2024

Langchain Ollama use pure request, which means it implement everything by itself for visiting Ollama endpoint, which is very unefficient. As Ollama has official Python library (which I test is blazing fast), we should use it.

2 replies

JokeJason Apr 15, 2024

Create a PR #20448. Welcome to provide feedback

chrisblodgett Apr 20, 2024

this seems like a big deal, I'm new to langchain but if I want to do chains and use things like the jsonoutputparser, etc does this mean I must use the langchain ollama which I've noticed the same issues the gpu usage never goes over 20%?

MaxiBoether · 2024-05-14T09:27:47Z

MaxiBoether
May 14, 2024

for those interested, in https://github.com/MaxiBoether/langchain-ollama-package I have prototyped using the ollama package also for e.g. the chat model, which massively speeds up the inference. I need to clean this a bit (tbh, Claude Opus did most of the work :D), might submit a PR later on.

0 replies

JokeJason · 2024-05-14T09:42:00Z

JokeJason
May 14, 2024

this seems like a big deal, I'm new to langchain but if I want to do chains and use things like the jsonoutputparser, etc does this mean I must use the langchain ollama which I've noticed the same issues the gpu usage never goes over 20%?

Sorry for late response, was busy on daily work. Yes, technqiuely the current implementation is very poor. looks like @MaxiBoether has did some interesting work

0 replies

kamalendugarai · 2024-06-06T05:27:38Z

kamalendugarai
Jun 6, 2024

Same issue stands for langchain js. I have tried with several models. All of these answering almost instantly in ollama cli. Langchain is getting at least one minute to deliver the answer for same question I have placed in the cli version. Please suggest if there is any fix available.

1 reply

JokeJason Jun 21, 2024

Same issue stands for langchain js. I have tried with several models. All of these answering almost instantly in ollama cli. Langchain is getting at least one minute to deliver the answer for same question I have placed in the cli version. Please suggest if there is any fix available.

Sorry bro, I'm not working on langchain.js :(

MikeyBeez · 2024-06-22T04:53:42Z

MikeyBeez
Jun 22, 2024

You can use streaming so you don't need to wait for the complete response to generate.

0 replies

diegitfk · 2024-11-21T22:12:03Z

diegitfk
Nov 21, 2024

Discussing the Issue

Hi, nice to meet you! My name is Diego Cancino, and I faced a similar problem. Let me give some context: when I used OllamaLLM from langchain_ollama, I encountered the significant issue of excessive slowness. The hardware I was using included an NVIDIA RTX 3070 graphics card, so I knew GPU performance wasn’t the bottleneck. Frustration drove me to dig deeper into the issue.

Initially, when you use Ollama with a freshly downloaded model via the ollama pull [model_name] command, the server optimizes the configurations to use the model efficiently. However, using OllamaLLM introduces configurations in the Ollama server that limit the response speed.

You can imagine my confusion when running my Python script and seeing responses much slower than directly using the Ollama container. For reference, I ran the Ollama container with the following command:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Eventually, I discovered that there is a parameter within OllamaLLM (and specific classes like ChatOllama) called num_gpu. While some might interpret this as the total number of GPUs to use, desperation led me to experiment. Since an RTX 3070 has 5,888 CUDA cores, I hypothesized that setting the num_gpu parameter to a value similar to this number might help. By enabling CUDA cores with PyTorch and adjusting the num_gpu parameter to a value close to 5,888, I observed a significantly faster response compared to leaving it set to 1.

For this reason, I suggest trying this configuration.

Problematic Code

from langchain_ollama.llms import OllamaLLM
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from arq_ai.faiss_index import FAISSIndexer
from uuid import UUID
from typing import AsyncGenerator

class FastAssistantChat:
    TEMPLATE_BENEFITS = """
        You are an assistant that provides concise benefits by summarizing context to add value to the salesperson during the sales process. 
        The output should be structured as follows:

        Product Title  
        Benefits  
            - benefit 1  
            - benefit 2  

        If no context is provided, say: "I don't have product information" and do not attempt to create new information.  
        {context}
        """

    def __init__(self):
        self.__chat_model = OllamaLLM(
            model="phi3:mini",
            temperature=0.2
        )
        self.__rag_custom_prompt = PromptTemplate.from_template(self.TEMPLATE_BENEFITS)
        self.__rag_chain = None

    def generate_chain(self):
        self.__rag_chain = (
            {"context" : RunnablePassthrough()}
            | self.__rag_custom_prompt
            | self.__chat_model
        )

    async def get_product_benefits(self , id_product : UUID , indexer: FAISSIndexer) -> AsyncGenerator[str , None]:
        if not self.__rag_chain:
            raise RuntimeError("The RAG chain isn't initialized")
        context = await indexer.query_benefits_sale(id_product)
        if not context:
            return "No product information found"
        result = await self.__rag_chain.ainvoke({'context' : context})
        return result

    async def get_benefits_only_product_stream(self , id_product : UUID , indexer : FAISSIndexer) -> AsyncGenerator[str , None]:
        if not self.__rag_chain:
            raise RuntimeError("The RAG chain isn't initialized")

        context = await indexer.query_benefits_sale(id_product)

        if not context:
            yield "No product information found"

        async for chunk in self.__rag_chain.astream({'context' : context}):
            print(chunk , end="" , flush=True)
            yield chunk

Refactored Constructor

I made some changes to the class constructor as follows:

def __init__(self):
        if torch.cuda.is_available():
            torch.backends.cudnn.benchmark = True
            torch.backends.cudnn.enabled = True
            print("CUDA optimizations enabled...")
        else:
            print("CUDA is not available")
        self.__chat_model = OllamaLLM(
            model="phi3:mini",
            temperature=0.2,
            num_gpu=5800
        )
        self.__rag_custom_prompt = PromptTemplate.from_template(self.TEMPLATE_BENEFITS)
        self.__rag_chain = None

Similarly, I am creating embeddings for FAISS in my project and found it reasonable to use all-mini:v6-l2. You might consider using it as well and check the response times with OllamaEmbeddings. It might be very helpful for you.

With these adjustments, I achieved significant improvements in my project. 😉 I hope this is helpful to you all!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is ollama running slowly? #18515

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why is ollama running slowly? #18515

Replies: 10 comments · 3 replies

Discussing the Issue

Problematic Code

Refactored Constructor

Replies: 10 comments 3 replies