Multi-Agent workflow taking too much time for a user query using LangGraph for Llama3 using Ollama when compared to running Agents on an individual basis. #23679

Ikshan-Tango · 2024-06-30T06:54:19Z

Ikshan-Tango
Jun 30, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_core.prompts import (
    ChatPromptTemplate,
    FewShotChatMessagePromptTemplate,
)
from langchain_community.chat_models import ChatOllama
from langchain_community.llms import Ollama
import time

def ddts_priority_agent(query: str, LLM = None) -> str:
    details_examples = [
        {
            "input": "status of CSCvx83805",
            "output": '{"intent_type": "details", "ddts_id": "CSCvx83805"}'
        },
        {
            "input": "Status of DDTS CSCvx83805",
            "output": '{"intent_type": "details", "ddts_id": "CSCvx83805"}'
        }
    ]
    priority_increase_examples = [
        {
            "input": "increase priority of <run_id>",
            "output": '{"intent_type": "increase", "run_id": "<run_id>"}'
        },
        {
            "input": "please increase priority of run <id> as soon as possible",
            "output": '{"intent_type": "increase", "run_id": "<id>"}'
        },
    ]

    examples = details_examples + priority_increase_examples
    example_prompt = ChatPromptTemplate.from_messages(
        [
            ("human", "{input}"),
            ("ai", "{output}"),
        ]
    )
    few_shot_prompt = FewShotChatMessagePromptTemplate(
        example_prompt=example_prompt,
        examples=examples,
    )

    system_prompt = """
        "You are an expert english support agent with the name Jeeves for a software company who specializes in identifying nouns. ",
        "Your task is to ONLY AND ONLY correctly identify the intent_type of the query and extract the extra params from the query. ",
        "The query given to you will be either of intent_type as 1. details 2. increase, ",
        "'details' if the query has details word in it, ",
        "'increase' if the query is asking to 'increase priority', ",
        "Extract the ddts_id from the query if the intent_type is details, ",
        "Extract the run_id from the query if the intent_type is priority. ",
        "You need to provide the output in json format. ",
        "As a tool you're not supposed to know the answer to the query, ",
        "rather you just have to extract the intent_type and extra params correctly. ",
        "IF YOU CANT EXTRACT OR YOU'RE CONFUSED THEN RETURN RESPONSE AS '{{\"error\": <YOUR MESSAGE HERE>}}' ",
        "The only extra params that you're allowed to send are 'run_id', 'ddts_id'."
    """

    final_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            few_shot_prompt,
            ("human", "{input}"),
        ]
    )
    if LLM is None:
        LLM = ChatOllama(
            model="llama3", 
            temperature= 0,
            num_gpu= 1,
            top_k = 20,
            top_p = 0.6,
            verbose= True
        )
    chain = final_prompt | LLM
    result = chain.invoke({"input": query})
    # print(result)
    return result
    # return result.content

def process_query(query):
    start_time = time.time()
    print(ddts_priority_agent(query))
    print(f"Time taken: {time.time() - start_time}")


if __name__ == "__main__":
    process_query("jeeves, please increase the priority of run with id as 3242342")
    process_query("hii jeeves please tell me how are you??")
    process_query("can you please tell me details of ddts CSCvx83805")

OUTPUT :-

content='{"intent_type": "increase", "run_id": "3242342"}' response_metadata={'model': 'llama3', 'created_at': '2024-06-30T06:23:14.611175525Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 18945420909, 'load_duration': 2496713860, 'prompt_eval_count': 412, 'prompt_eval_duration': 13978704000, 'eval_count': 17, 'eval_duration': 2204716000} id='run-d8efec83-a0e2-41fa-894a-df788aba807d-0'
Time taken: 18.95509696006775
content='{"error": "Invalid query. I\'m only designed to handle \'details\' and \'increase\' intent types."}' response_metadata={'model': 'llama3', 'created_at': '2024-06-30T06:23:18.995052722Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 4379495259, 'load_duration': 41987871, 'prompt_eval_count': 17, 'prompt_eval_duration': 660977000, 'eval_count': 25, 'eval_duration': 3367475000} id='run-a24b188f-c4af-4e84-a8ae-541578d813b5-0'
Time taken: 4.38374137878418
content='{"intent_type": "details", "ddts_id": "CSCvx83805"}' response_metadata={'model': 'llama3', 'created_at': '2024-06-30T06:23:22.666159586Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 3666907428, 'load_duration': 891099, 'prompt_eval_count': 18, 'prompt_eval_duration': 674842000, 'eval_count': 20, 'eval_duration': 2684880000} id='run-e54c051a-54c8-4fe6-a530-0f78d4cf5d88-0'
Time taken: 3.6712684631347656

Description

PROBLEM STATEMENT

I'm trying to create a Multi-Agent Workflow using LangGraph and when I run the Agents on an individual script based level, the first query takes around 15-20 seconds to run but the next queries are answered within 3-5 seconds. This is the same case for all of my Agents.

Now that when I've created a Worklow for my LangGraph which contains a supervisor Agent which redirects the query to the correct Agent, each query takes around more than 100 seconds to give the actual response which should not be the case. I am willing to take a much higher response time for the first query but for the subsequent queries I need the timing to be as low as possible.

Directory structure :-

Important things to note :-

All of my agents are currently function based Agents where in the examples and LLM is loaded.
I thought initially that due to the examples being loaded again with the LLM, that might be the issue for such high inference timings but that's not the case as well when I tried to create the chains only once.
I tried to use a single LLM instance for all the queries thinking that would reduce the time but it lead to giving even late response timings by around 5-8 seconds for each query.

Example Agent Code

from langchain_core.prompts import (
    ChatPromptTemplate,
    FewShotChatMessagePromptTemplate,
)
from langchain_community.chat_models import ChatOllama
from langchain_community.llms import Ollama
import time

def ddts_priority_agent(query: str, LLM = None) -> str:
    details_examples = [
        {
            "input": "status of CSCvx83805",
            "output": '{"intent_type": "details", "ddts_id": "CSCvx83805"}'
        },
        {
            "input": "Status of DDTS CSCvx83805",
            "output": '{"intent_type": "details", "ddts_id": "CSCvx83805"}'
        }
    ]
    priority_increase_examples = [
        {
            "input": "increase priority of <run_id>",
            "output": '{"intent_type": "increase", "run_id": "<run_id>"}'
        },
        {
            "input": "please increase priority of run <id> as soon as possible",
            "output": '{"intent_type": "increase", "run_id": "<id>"}'
        },
    ]

    examples = details_examples + priority_increase_examples
    example_prompt = ChatPromptTemplate.from_messages(
        [
            ("human", "{input}"),
            ("ai", "{output}"),
        ]
    )
    few_shot_prompt = FewShotChatMessagePromptTemplate(
        example_prompt=example_prompt,
        examples=examples,
    )

    system_prompt = """
        "You are an expert english support agent with the name Jeeves for a software company who specializes in identifying nouns. ",
        "Your task is to ONLY AND ONLY correctly identify the intent_type of the query and extract the extra params from the query. ",
        "The query given to you will be either of intent_type as 1. details 2. increase, ",
        "'details' if the query has details word in it, ",
        "'increase' if the query is asking to 'increase priority', ",
        "Extract the ddts_id from the query if the intent_type is details, ",
        "Extract the run_id from the query if the intent_type is priority. ",
        "You need to provide the output in json format. ",
        "As a tool you're not supposed to know the answer to the query, ",
        "rather you just have to extract the intent_type and extra params correctly. ",
        "IF YOU CANT EXTRACT OR YOU'RE CONFUSED THEN RETURN RESPONSE AS '{{\"error\": <YOUR MESSAGE HERE>}}' ",
        "The only extra params that you're allowed to send are 'run_id', 'ddts_id'."
    """

    final_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            few_shot_prompt,
            ("human", "{input}"),
        ]
    )
    if LLM is None:
        # LLM = Ollama(model = 'llama3')
        LLM = ChatOllama(
            model="llama3", 
            temperature= 0,
            num_gpu= 1,
            top_k = 20,
            top_p = 0.6,
            verbose= True
        )
    chain = final_prompt | LLM
    result = chain.invoke({"input": query})
    # print(result)
    return result
    # return result.content

def process_query(query):
    start_time = time.time()
    print(ddts_priority_agent(query))
    print(f"Time taken: {time.time() - start_time}")

if __name__ == "__main__":
    process_query("jeeves, please increase the priority of run with id as 3242342")
    process_query("hii jeeves please tell me how are you??")
    process_query("can you please tell me details of ddts CSCvx83805")

Graph Workflow :-

import time
from langchain_community.chat_models import ChatOllama
from langgraph.graph import END, StateGraph
from ..supervisor import supervisor_chain
from ..Agents.efr import efr_agent_chain
from .nodes import Nodes  
from .state import ChatBotState  
import os
from langsmith import Client


class Workflow:
    def __init__(self):
        self.LLM = ChatOllama(
            base_url = "http://localhost:11434",
            model="llama3", 
            temperature= 0,
            num_gpu= 1,
            top_k = 5,
            top_p = 0.6,
            keep_alive = "360",
            verbose= True
        )
        self.EFR_chain = efr_agent_chain()
        self.SUPERVISOR_chain = supervisor_chain()

        nodes = Nodes(self.LLM, self.EFR_chain, self.SUPERVISOR_chain)
        workflow = StateGraph(ChatBotState)
        workflow.add_node("supervisor", nodes.supervisor)  # query_classifier
        workflow.add_node("agent", nodes.agent)  # redirection

        workflow.set_entry_point("supervisor")
        workflow.add_edge("supervisor", "agent")
        workflow.add_edge("agent", END)

        self.app = workflow.compile()

    def run(self, query):
        inputs = {"query": query, "agent_output": []}
        result = self.app.invoke(inputs)
        return result

if __name__ == "__main__":
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
    os.environ["LANGCHAIN_API_KEY"]  = ""
    os.environ["LANGCHAIN_PROJECT"]  = "Intentify-AI"

    client = Client()

    workflow = Workflow()
    queries = [
        "good efrs range for ncs5500_Sanity",
        "current efrs for 8000_sim_sanity on xr-dev lineup",
        "ncs540_sanity result for efr G-593",
        "jeeves what are the testbeds with lab server ip 10.23.255.243",
    ]
    for query in queries:
        start_time = time.time()
        result = workflow.run(query)
        end_time = time.time()
        print("Result: ", result)
        print("TOTAL TIME TAKEN: ", end_time - start_time)
        print("\n\n")

nodes.py file

from ..Agents.ddts_priority import ddts_priority_agent
from ..Agents.efr import efr_agent
from ..Agents.runs import runs_agent
from ..Agents.sanity import sanity_agent
from ..Agents.testbed import testbed_agent
# from ...sanity_lineup import sanity_lineup_extractor
# from ..supervisor import supervisor_chain
import time

class Nodes:
    def __init__(self, LLM, EFR_chain, SUPERVISOR_chain):
        self.LLM = LLM
        self.EFR_chain = EFR_chain
        self.SUPERVISOR_chain = SUPERVISOR_chain

    def supervisor(self, state):
        start_time = time.time()
        query = state.get("query")
        print("RECIEVED QUERY IN SUPERVISOR: ", query)
        # result = (supervisor_chain(query, self.LLM))
        # result = (supervisor_chain(query))
        result = self.SUPERVISOR_chain.invoke({"input": query})
        result = result.content
        result = result.strip().lower()
        end_time = time.time() 
        print("TIME TAKEN FOR SUPERVISOR: ", end_time - start_time)
        state["agent_output"].append({"supervisor_output": result})
        return state

    def agent(self, state):
        start_time = time.time()
        supvervisor_output = state.get("agent_output")[-1].get("supervisor_output")
        query = state.get("query")
        result = ""
        if supvervisor_output == "sanity_agent":
            # result = sanity_agent(query, self.LLM)
            result = sanity_agent(query)
        elif supvervisor_output == "runs_agent":
            # result = runs_agent(query, self.LLM)
            result = runs_agent(query)
        elif supvervisor_output == "testbed_agent":
            # result = testbed_agent(query, self.LLM)
            result = testbed_agent(query)
        elif supvervisor_output == "efr_agent":
            # result = efr_agent(query, self.LLM)
            result = self.EFR_chain.invoke({"input": query})
            result = result.content
            # result = efr_agent(query)
        elif supvervisor_output == "ddts_priority_agent":
            # result = ddts_priority_agent(query, self.LLM)
            result = ddts_priority_agent(query)

        end_time = time.time()
        print("TIME TAKEN FOR AGENT : ", end_time - start_time)
        state["agent_output"].append({"agent_output": result})
        return state

state.py

from typing import TypedDict, List, Dict
class ChatBotState(TypedDict):
    query: str
    agent_output: List[Dict[str, str]]

Workflow output when executed :-

RECIEVED QUERY IN SUPERVISOR:  good efrs range for ncs5500_Sanity
TIME TAKEN FOR SUPERVISOR:  75.60206842422485
TIME TAKEN FOR AGENT :  26.841528177261353
Result:  {'query': 'good efrs range for ncs5500_Sanity', 'agent_output': [{'supervisor_output': 'efr_agent'}, {'agent_output': '{"intent_type": "good_efr", "sanity_id": "ncs5500_Sanity"}'}]}
TOTAL TIME TAKEN:  102.47900199890137



RECIEVED QUERY IN SUPERVISOR:  current efrs for 8000_sim_sanity on xr-dev lineup
TIME TAKEN FOR SUPERVISOR:  67.08447980880737
TIME TAKEN FOR AGENT :  29.294508934020996
Result:  {'query': 'current efrs for 8000_sim_sanity on xr-dev lineup', 'agent_output': [{'supervisor_output': 'efr_agent'}, {'agent_output': '{"intent_type": "current_efr", "sanity_id": "8000_sim_sanity", "lineup": "xr-dev"}'}]}
TOTAL TIME TAKEN:  96.38690257072449



RECIEVED QUERY IN SUPERVISOR:  ncs540_sanity result for efr G-593
TIME TAKEN FOR SUPERVISOR:  66.67072916030884
TIME TAKEN FOR AGENT :  28.97744059562683
Result:  {'query': 'ncs540_sanity result for efr G-593', 'agent_output': [{'supervisor_output': 'efr_agent'}, {'agent_output': '{"intent_type": "result", "sanity_id": "ncs540_sanity", "efr": "G-593"}'}]}
TOTAL TIME TAKEN:  95.65621829032898



RECIEVED QUERY IN SUPERVISOR:  jeeves what are the testbeds with lab server ip 10.23.255.243
TIME TAKEN FOR SUPERVISOR:  66.23272633552551
TIME TAKEN FOR AGENT :  40.24121022224426
Result:  {'query': 'jeeves what are the testbeds with lab server ip 10.23.255.243', 'agent_output': [{'supervisor_output': 'testbed_agent'}, {'agent_output': '{"intent_type": "server", "server_ip": "10.23.255.243", "server_type": "lab"}'}]}
TOTAL TIME TAKEN:  106.4831805229187

I believe my PC specs are capable enough for running the model and it should not be a hardware issue but more of a software issue.

Hardware Specs

RAM - 16GB
Sec Memory - 1TB SSD
Processor - Ryzen 9 5000 X series
GPU - RTX 3050 4GB

System Info

System Information

OS: Linux
OS Version: #202401051437~~1709085277~~22.04~31d73d8 SMP PREEMPT_DYNAMIC Wed F
Python Version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

Package Information

langchain_core: 0.2.7
langchain: 0.2.5
langchain_community: 0.2.5
langsmith: 0.1.77
langchain_experimental: 0.0.61
langchain_huggingface: 0.0.3
langchain_openai: 0.1.8
langchain_text_splitters: 0.2.1
langgraph: 0.1.1

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langserve

@Ikshan-Tango · 2024-06-30T07:01:41Z

dosubot[bot]
bot Jun 30, 2024

@Ikshan-Tango Hello! I'm here to assist you with any bugs, questions, or contributions you might have.

To optimize the response time for subsequent queries in your Multi-Agent Workflow using LangGraph for Llama3 with Ollama, you can consider the following strategies:

Use a Local LLM for Query Generation:
- Leverage a private, local LLM for the narrow task of query generation to avoid excessive calls to a larger LLM API. This can significantly reduce the response time.
Fine-Tune LLM Parameters:
- Adjust the number of threads (num_thread), the number of GPUs (num_gpu), and the context window size (num_ctx) to optimize performance based on your hardware capabilities.
Utilize the keep_alive Parameter:
- Set the keep_alive parameter to keep the model loaded in memory for a specified duration, avoiding the overhead of reloading the model for each request. For example, setting keep_alive to a duration string like "10m" (10 minutes) or a number in seconds like 3600 (1 hour) can help maintain session persistence and reduce initialization times.
Implement Caching Mechanisms:
- Use caching mechanisms provided by LangChain, such as InMemoryCache, RedisCache, or SQLiteCache, to store and reuse responses for repeated queries. This can save time and reduce the number of API calls to the LLM provider.

Here is an example of how to set the keep_alive parameter and use caching:

from langchain_community.llms import Ollama
from langchain_community.cache import InMemoryCache
from langchain.globals import set_llm_cache

# Initialize the cache
set_llm_cache(InMemoryCache())

# Initialize the LLM with keep_alive parameter
ollama = Ollama(
    model="llama3",
    temperature=0,
    num_gpu=1,
    top_k=20,
    top_p=0.6,
    keep_alive="10m",  # Keep the model loaded for 10 minutes
    verbose=True
)

# Example prompt
prompt = "Tell me a joke"

# First call, not yet in cache
response = ollama(prompt)
print(response)

# Second call, should be faster as it is cached
response = ollama(prompt)
print(response)

By implementing these strategies, you can optimize the response time for subsequent queries in your Multi-Agent Workflow using LangGraph for Llama3 with Ollama ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

1 reply

dosubot[bot] bot Jun 30, 2024

Using the same LLM instance across different agents can help reduce the response time by avoiding the overhead of initializing multiple instances. Here are some best practices to follow:

Single LLM Instance:
- Pass the same LLM instance to different agents to avoid the overhead of creating multiple instances. This can help in maintaining a warm state for the LLM, reducing the initialization time for each query.
Keep Alive Parameter:
- Set the keep_alive parameter to keep the model loaded in memory for a specified duration. This ensures that the model remains in memory and is ready to handle subsequent requests quickly.
Efficient Resource Management:
- Ensure that the LLM instance is configured to utilize the available hardware resources efficiently. Adjust parameters like num_gpu, num_thread, and num_ctx based on your hardware capabilities.

Here is an example of how to pass the same LLM instance to different agents:

from langchain_community.chat_models import ChatOllama

# Initialize the LLM instance
llm_instance = ChatOllama(
    model="llama3",
    temperature=0,
    num_gpu=1,
    top_k=20,
    top_p=0.6,
    keep_alive="10m",  # Keep the model loaded for 10 minutes
    verbose=True
)

# Pass the same LLM instance to different agents
def ddts_priority_agent(query: str, LLM=llm_instance) -> str:
    # Agent logic here
    pass

def efr_agent(query: str, LLM=llm_instance) -> str:
    # Agent logic here
    pass

# Example usage
query = "jeeves, please increase the priority of run with id as 3242342"
result = ddts_priority_agent(query)
print(result)

By following these practices, you can optimize the response time for your multi-agent workflow ^[1].

To continue talking to Dosu, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Agent workflow taking too much time for a user query using LangGraph for Llama3 using Ollama when compared to running Agents on an individual basis. #23679

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Multi-Agent workflow taking too much time for a user query using LangGraph for Llama3 using Ollama when compared to running Agents on an individual basis. #23679

Ikshan-Tango Jun 30, 2024

Checked other resources

Commit to Help

Example Code

Description

PROBLEM STATEMENT

Directory structure :-

Important things to note :-

Example Agent Code

Graph Workflow :-

nodes.py file

state.py

Workflow output when executed :-

Hardware Specs

System Info

System Information

Package Information

Packages not installed (Not Necessarily a Problem)

Replies: 1 comment · 1 reply

dosubot[bot] bot Jun 30, 2024

dosubot[bot] bot Jun 30, 2024

Ikshan-Tango
Jun 30, 2024

Replies: 1 comment 1 reply

dosubot[bot]
bot Jun 30, 2024