Local LLM chat incl. history using PromptTemplates. Key errors, formatting question & best practice #26974

mxmahmoud · 2024-09-29T14:27:02Z

mxmahmoud
Sep 29, 2024

System Info

GPU : RTX 4080 Super
docker, pulled img nvcr.io/nvidia/pytorch:24.06-py3 (wsl: ubuntu 22.04) # but upgraded some packages and so on..
langchain==0.3.1
langchain-community==0.3.1
langchain-core==0.3.6
langchain-text-splitters==0.3.0
torch==2.4.1
torchvision==0.19.1+cu124

Description

I am new to utilizing LLms, so please bear with me as I explain my current understanding of using local LLMs and what I am trying to achieve with Langchain. If anything in my understanding is incorrect or not following best practices, please feel free to let me know.

My goal is to integrate the LLM with TTS and the Metahuman SDK to animate lip movements in Unreal Engine, while aiming for the lowest possible latency. I want to either interact directly with the LLM in a chat manner or have the LLM evaluate if i need to perform research, such as querying Wikipedia, based on the users request.

My Current Understanding:

I have downloaded a Llama 3.1 8B Q4 GGUF model from Hugging Face, provided by SanctumAI. The folder contains the GGUF file and a config JSON with basic model-type information. From what I’ve seen, when I use standard Langchain examples without defining a custom chat template, the results aren’t very good - (response contains unwanted imaginary chat participants, hallucination, bad reasoning etc..)

Since I’m inexperienced with LLM dev, I’m unsure if the chat template needs to strictly follow the format outlined in the model's meta-data overview, or if I can define my own template. For now, I’ve assumed I need to stick with the provided format, which looks like this:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Additionally, I tried adapting the LM Studio (Jinja-style) template, which concatenates exchanges like this:

<|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>{answer}<|eot_id|>
<|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>{answer}<|eot_id|>
..
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>  <--- {not sure if system prompt should be placed at the beginning or short before new request}
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>    <--- {new prediction}

Concerns:

I’m unsure if I’m approaching this correctly, and whether my assumptions are valid.
I’m not certain if the history concatenation works as I expect.?
Is this the most efficient way to assemble and load data onto the GPU, or are there potential overheads from my coding style or approach that I could optimize?
Are there some parameters that i can set or should avoid or packages that i can rebuild to utilize in the best possible way my GPU? (used for example CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=89" FORCE_CMAKE=1 CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose)

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate
from langchain_core.prompts.chat import HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain_community.llms import LlamaCpp
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from utility.utility import PROJ_DIR, model_weights_exist
import torch


n_ctx = 4096  # Maximum number of tokens in context
n_batch = 512
n_gpu_layers = -1
n_threads = 12

hugging_face_model_reference= "SanctumAI/Meta-Llama-3.1-8B-Instruct-GGUF"
model_path = PROJ_DIR / "models" / "LLM" / "Meta" / "SanctumAI" / "meta-llama-3.1-8b-instruct.Q4_1.gguf"
save_weights = False

machine_role, user_role = "assistant", "user"
system_prompt = "The input could be a look up request or a question that you can answer immediatley. Write at the beginning the type of request as such: Internet, Wikipedia, General Question. Followed by '|' as a seperator and the look up terms that i should use or the direct response to the question."
history_template_str ="<|start_header_id|>{question_role}<|end_header_id|>{{question}}<|eot_id|>\n<|start_header_id|>{machine_role}<|end_header_id|>{{answer}}<|eot_id|>".format( machine_role=machine_role, question_role=user_role)

gen_prompot_template_str = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_prompt}<|eot_id|>
<|start_header_id|>{question_role}<|end_header_id|>{{question}}<|eot_id|><|start_header_id|>{machine_role}<|end_header_id|>""".format(system_prompt=system_prompt, machine_role=machine_role, question_role=user_role)

human_message_prompt = HumanMessagePromptTemplate(prompt=PromptTemplate(template=gen_prompot_template_str,input_variables=["question"]))
ai_message_prompt = AIMessagePromptTemplate(prompt=PromptTemplate(template=history_template_str,input_variables=["question", "answer"]))

if not model_weights_exist(model_path.parent, ext="gguf"):
    print(f"Creating model directory & downloading weights: {model_path}")
    model_path.parent.mkdir(exist_ok=True, parents=True)
    print("...")
    exit()
else:
    llm = LlamaCpp(
        model_path=str(model_path),
        n_gpu_layers=n_gpu_layers,
        n_ctx=n_ctx, 
        n_batch=n_batch,
        use_mlock=True,
        use_mmap=True,
        model_kwargs={"dtype": torch.float16,"use_flash_attention" :True},
        temperature=0.2,
        top_p=1,
        rope_freq_base=0,
        rope_freq_scale=0,
        verbose=False,
    )


prompt = ChatPromptTemplate.from_messages([ai_message_prompt, MessagesPlaceholder("chat_history", optional=True), human_message_prompt])
runnable = prompt | llm
store = {}
def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

with_message_history = RunnableWithMessageHistory(runnable, get_session_history, input_messages_key="question", history_messages_key="chat_history", output_messages_key="answer")

session_config = {"configurable": {"session_id": "0"}}

print("Welcome to the interactive chat with LLaMA. Type 'exit' to quit.")
while True:
    user_input = input("User: ")
    if 'end conversation' in user_input.lower(): break
    output = with_message_history.invoke({"question": user_input},config=session_config)
    print(f"Assistant: {output}")

Error Encountered:

KeyError: "Input to ChatPromptTemplate is missing variables {'answer'}. Expected: ['answer', 'question'] Received: ['question', 'chat_history']

Answered by mxmahmoud

Oct 22, 2024

this is how i did it for now


        self.llm, self.model_type, self.config = self.load_llm(
            model_path=model_path,
            model_dict_key=model_dict_key,
            config_json_path=config_json_path,
            dtype=dtype,
        )

        self.system_prompt = self.config["LLM"]["system_prompt"]
        self.tokenizer_config = self.config["LLM"]["tokenizer"][self.model_type]
        self.sys_beg = self.tokenizer_config["sys_beg"]       # "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
        self.sys_end = self.tokenizer_config["sys_end"]       # "<|eot_id|>"
        self.ai_n_beg = self.tokenizer_config["ai_n_beg"]     # "<|start_header_id|>assis…

View full answer

@mxmahmoud · 2024-09-29T14:30:35Z

dosubot[bot]
bot Sep 29, 2024

Hey @mxmahmoud! I'm here to assist you with any bugs, questions, or contributions you have. Let's tackle this together while we wait for a human maintainer.

I found some similar discussions and issues that might help you resolve the KeyError and optimize your setup:

KeyError related to missing variables:
- Issue: KeyError when formatting FewShotPromptTemplate ^[1]
- Solution: Ensure that the input_variables in your PromptTemplate match the variables used in your template string. For example, if your template uses question and answer, both should be included in input_variables.
Efficient history concatenation and GPU utilization:
- Discussion: history is not loaded into the PromptTemplate ^[2]
- Solution: Ensure that the history variable is correctly formatted and passed to the predict method. Use efficient methods to concatenate chat history and leverage GPU resources.

Here's how you can address the KeyError and optimize your setup:

Resolving KeyError

Ensure that your PromptTemplate definitions include all necessary variables:

gen_prompot_template_str = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_prompt}<|eot_id|>
<|start_header_id|>{question_role}<|end_header_id|>{{question}}<|eot_id|><|start_header_id|>{machine_role}<|end_header_id|>"""

history_template_str = """<|start_header_id|>{question_role}<|end_header_id|>{{question}}<|eot_id|>\n<|start_header_id|>{machine_role}<|end_header_id|>{{answer}}<|eot_id|>"""

human_message_prompt = HumanMessagePromptTemplate(prompt=PromptTemplate(template=gen_prompot_template_str, input_variables=["question"]))
ai_message_prompt = AIMessagePromptTemplate(prompt=PromptTemplate(template=history_template_str, input_variables=["question", "answer"]))

Efficient History Concatenation

Use a function to concatenate chat history efficiently:

def concatenate_chat_history(chat_history):
    buffer = ""
    for dialogue_turn in chat_history:
        if isinstance(dialogue_turn, tuple):
            human = "Human: " + dialogue_turn[0]
            ai = "Assistant: " + dialogue_turn[1]
            buffer += "\n" + "\n".join([human, ai])
        else:
            raise ValueError(f"Unsupported chat history format: {type(dialogue_turn)}. Full chat history: {chat_history}")
    return buffer

GPU Optimization

Configure your LLM to utilize GPU efficiently:

n_gpu_layers = -1  # Move all layers to GPU
n_batch = 512  # Adjust based on your GPU's VRAM

llm = LlamaCpp(
    model_path=str(model_path),
    n_gpu_layers=n_gpu_layers,
    n_ctx=n_ctx,
    n_batch=n_batch,
    use_mlock=True,
    use_mmap=True,
    model_kwargs={"dtype": torch.float16, "use_flash_attention": True},
    temperature=0.2,
    top_p=1,
    rope_freq_base=0,
    rope_freq_scale=0,
    verbose=False,
)

Example Usage

Integrate the above components into your chat application:

prompt = ChatPromptTemplate.from_messages(chat_history=[ai_message_prompt, MessagesPlaceholder("chat_history", optional=True), human_message_prompt])
runnable = prompt | llm
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

with_message_history = RunnableWithMessageHistory(runnable, get_session_history, input_messages_key="question", history_messages_key="chat_history", output_messages_key="answer")

session_config = {"configurable": {"session_id": "0"}}

print("Welcome to the interactive chat with LLaMA. Type 'exit' to quit.")
while True:
    user_input = input("User: ")
    if 'end conversation' in user_input.lower(): break
    output = with_message_history.invoke({"question": user_input}, config=session_config)
    print(f"Assistant: {output}")

By ensuring that your PromptTemplate definitions are correct and optimizing your GPU usage, you should be able to resolve the KeyError and improve the performance of your local LLM chat application.

To continue talking to Dosu, mention @dosu.

2 replies

mxmahmoud Sep 29, 2024
Author

@dosu the cose is almost the same. also the referenced post are using an older version of langchain. I have edited my post, since i accedentily added chat_history argument assignment in ChatPromptTemplate.from_messages, which was a temporary change in my original code.

dosubot[bot] bot Sep 29, 2024

To resolve the KeyError related to missing variables and ensure efficient history concatenation and GPU utilization, follow these steps:

Resolving KeyError

Ensure that your PromptTemplate definitions include all necessary variables:

gen_prompot_template_str = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_prompt}<|eot_id|>
<|start_header_id|>{question_role}<|end_header_id|>{{question}}<|eot_id|><|start_header_id|>{machine_role}<|end_header_id|>"""

history_template_str = """<|start_header_id|>{question_role}<|end_header_id|>{{question}}<|eot_id|>\n<|start_header_id|>{machine_role}<|end_header_id|>{{answer}}<|eot_id|>"""

human_message_prompt = HumanMessagePromptTemplate(prompt=PromptTemplate(template=gen_prompot_template_str, input_variables=["question"]))
ai_message_prompt = AIMessagePromptTemplate(prompt=PromptTemplate(template=history_template_str, input_variables=["question", "answer"]))

Efficient History Concatenation

Use the ConversationTokenBufferMemory class to manage the conversation history efficiently:

from langchain.memory.token_buffer import ConversationTokenBufferMemory
from langchain_core.language_models import BaseLanguageModel
from langchain_core.chat_history import BaseChatMessageHistory

llm = BaseLanguageModel()
chat_memory = BaseChatMessageHistory()

conversation_memory = ConversationTokenBufferMemory(
    llm=llm,
    chat_memory=chat_memory,
    max_token_limit=2000
)

conversation_memory.save_context({"Human": "Hi there"}, {"AI": "Nice to meet you!"})
conversation_memory.save_context({"Human": "Nice day isn't it?"}, {"AI": "I love Wednesdays."})
conversation_memory.load_memory_variables({"input": "What time is it?"})

GPU Optimization

Configure your LLM to utilize GPU efficiently:

llm = LlamaCpp(
    model_path=str(model_path),
    n_gpu_layers=2,  # Adjust based on your GPU memory
    n_batch=1024,    # Adjust based on your hardware capability
    n_ctx=2048,
    f16_kv=True,
    use_mmap=True,
    use_mlock=True,  # Adjust based on your RAM availability
    n_threads=8,     # Adjust based on your CPU capability
    repeat_penalty=1.2,  # Example adjustment
    top_k=50,  # Example adjustment
    last_n_tokens_size=128,  # Example adjustment
    verbose=True,
)

Example Usage

Integrate the above components into your chat application:

prompt = ChatPromptTemplate.from_messages([ai_message_prompt, MessagesPlaceholder("chat_history", optional=True), human_message_prompt])
runnable = prompt | llm
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

with_message_history = RunnableWithMessageHistory(runnable, get_session_history, input_messages_key="question", history_messages_key="chat_history", output_messages_key="answer")

session_config = {"configurable": {"session_id": "0"}}

print("Welcome to the interactive chat with LLaMA. Type 'exit' to quit.")
while True:
    user_input = input("User: ")
    if 'end conversation' in user_input.lower(): break
    output = with_message_history.invoke({"question": user_input}, config=session_config)
    print(f"Assistant: {output}")

By ensuring that all required input variables are provided, using ConversationTokenBufferMemory for efficient history management, and optimizing your GPU usage, you can resolve the KeyError and improve the performance of your LangChain setup ^[1]^[2].

To continue talking to Dosu, mention @dosu.

mxmahmoud · 2024-10-22T09:46:21Z

mxmahmoud
Oct 22, 2024
Author

this is how i did it for now


        self.llm, self.model_type, self.config = self.load_llm(
            model_path=model_path,
            model_dict_key=model_dict_key,
            config_json_path=config_json_path,
            dtype=dtype,
        )

        self.system_prompt = self.config["LLM"]["system_prompt"]
        self.tokenizer_config = self.config["LLM"]["tokenizer"][self.model_type]
        self.sys_beg = self.tokenizer_config["sys_beg"]       # "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
        self.sys_end = self.tokenizer_config["sys_end"]       # "<|eot_id|>"
        self.ai_n_beg = self.tokenizer_config["ai_n_beg"]     # "<|start_header_id|>assistant<|end_header_id|>\n"
        self.ai_n_end = self.tokenizer_config["ai_n_end"]     # "<|eot_id|>\n"
        self.usr_n_beg = self.tokenizer_config["usr_n_beg"]   # "<|start_header_id|>user<|end_header_id|>\n\n"
        self.usr_n_end = self.tokenizer_config["usr_n_end"]   # "<|eot_id|>\n"
        self.usr_0_beg = self.tokenizer_config["usr_0_beg"]   # "<|start_header_id|>user<|end_header_id|>\n\n"
        self.usr_0_end = self.tokenizer_config["usr_0_end"]   # "<|eot_id|>\n"

        self.model = Llama2Chat(llm=self.llm, sys_beg=self.sys_beg, sys_end=self.sys_end, ai_n_beg=self.ai_n_beg, ai_n_end=self.ai_n_end, usr_n_beg=self.usr_n_beg, usr_n_end=self.usr_n_end, usr_0_beg=self.usr_0_beg, usr_0_end=self.usr_0_end)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local LLM chat incl. history using PromptTemplates. Key errors, formatting question & best practice #26974

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Local LLM chat incl. history using PromptTemplates. Key errors, formatting question & best practice #26974

mxmahmoud Sep 29, 2024

System Info

Description

My Current Understanding:

Concerns:

Checked other resources

Commit to Help

Example Code

Error Encountered:

Replies: 2 comments · 2 replies

dosubot[bot] bot Sep 29, 2024

Resolving KeyError

Efficient History Concatenation

GPU Optimization

Example Usage

mxmahmoud Sep 29, 2024 Author

dosubot[bot] bot Sep 29, 2024

Resolving KeyError

Efficient History Concatenation

GPU Optimization

Example Usage

mxmahmoud Oct 22, 2024 Author

mxmahmoud
Sep 29, 2024

Replies: 2 comments 2 replies

dosubot[bot]
bot Sep 29, 2024

mxmahmoud Sep 29, 2024
Author

mxmahmoud
Oct 22, 2024
Author