-
System InfoGPU : RTX 4080 Super DescriptionI am new to utilizing LLms, so please bear with me as I explain my current understanding of using local LLMs and what I am trying to achieve with Langchain. If anything in my understanding is incorrect or not following best practices, please feel free to let me know. My goal is to integrate the LLM with TTS and the Metahuman SDK to animate lip movements in Unreal Engine, while aiming for the lowest possible latency. I want to either interact directly with the LLM in a chat manner or have the LLM evaluate if i need to perform research, such as querying Wikipedia, based on the users request. My Current Understanding:I have downloaded a Llama 3.1 8B Q4 GGUF model from Hugging Face, provided by SanctumAI. The folder contains the GGUF file and a config JSON with basic model-type information. From what I’ve seen, when I use standard Langchain examples without defining a custom chat template, the results aren’t very good - (response contains unwanted imaginary chat participants, hallucination, bad reasoning etc..) Since I’m inexperienced with LLM dev, I’m unsure if the chat template needs to strictly follow the format outlined in the model's meta-data overview, or if I can define my own template. For now, I’ve assumed I need to stick with the provided format, which looks like this:
Additionally, I tried adapting the LM Studio (Jinja-style) template, which concatenates exchanges like this:
Concerns:
Checked other resources
Commit to Help
Example Codefrom langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate
from langchain_core.prompts.chat import HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain_community.llms import LlamaCpp
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from utility.utility import PROJ_DIR, model_weights_exist
import torch
n_ctx = 4096 # Maximum number of tokens in context
n_batch = 512
n_gpu_layers = -1
n_threads = 12
hugging_face_model_reference= "SanctumAI/Meta-Llama-3.1-8B-Instruct-GGUF"
model_path = PROJ_DIR / "models" / "LLM" / "Meta" / "SanctumAI" / "meta-llama-3.1-8b-instruct.Q4_1.gguf"
save_weights = False
machine_role, user_role = "assistant", "user"
system_prompt = "The input could be a look up request or a question that you can answer immediatley. Write at the beginning the type of request as such: Internet, Wikipedia, General Question. Followed by '|' as a seperator and the look up terms that i should use or the direct response to the question."
history_template_str ="<|start_header_id|>{question_role}<|end_header_id|>{{question}}<|eot_id|>\n<|start_header_id|>{machine_role}<|end_header_id|>{{answer}}<|eot_id|>".format( machine_role=machine_role, question_role=user_role)
gen_prompot_template_str = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_prompt}<|eot_id|>
<|start_header_id|>{question_role}<|end_header_id|>{{question}}<|eot_id|><|start_header_id|>{machine_role}<|end_header_id|>""".format(system_prompt=system_prompt, machine_role=machine_role, question_role=user_role)
human_message_prompt = HumanMessagePromptTemplate(prompt=PromptTemplate(template=gen_prompot_template_str,input_variables=["question"]))
ai_message_prompt = AIMessagePromptTemplate(prompt=PromptTemplate(template=history_template_str,input_variables=["question", "answer"]))
if not model_weights_exist(model_path.parent, ext="gguf"):
print(f"Creating model directory & downloading weights: {model_path}")
model_path.parent.mkdir(exist_ok=True, parents=True)
print("...")
exit()
else:
llm = LlamaCpp(
model_path=str(model_path),
n_gpu_layers=n_gpu_layers,
n_ctx=n_ctx,
n_batch=n_batch,
use_mlock=True,
use_mmap=True,
model_kwargs={"dtype": torch.float16,"use_flash_attention" :True},
temperature=0.2,
top_p=1,
rope_freq_base=0,
rope_freq_scale=0,
verbose=False,
)
prompt = ChatPromptTemplate.from_messages([ai_message_prompt, MessagesPlaceholder("chat_history", optional=True), human_message_prompt])
runnable = prompt | llm
store = {}
def get_session_history(session_id: str) -> BaseChatMessageHistory:
if session_id not in store:
store[session_id] = ChatMessageHistory()
return store[session_id]
with_message_history = RunnableWithMessageHistory(runnable, get_session_history, input_messages_key="question", history_messages_key="chat_history", output_messages_key="answer")
session_config = {"configurable": {"session_id": "0"}}
print("Welcome to the interactive chat with LLaMA. Type 'exit' to quit.")
while True:
user_input = input("User: ")
if 'end conversation' in user_input.lower(): break
output = with_message_history.invoke({"question": user_input},config=session_config)
print(f"Assistant: {output}") Error Encountered:
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hey @mxmahmoud! I'm here to assist you with any bugs, questions, or contributions you have. Let's tackle this together while we wait for a human maintainer. I found some similar discussions and issues that might help you resolve the
Here's how you can address the Resolving KeyErrorEnsure that your gen_prompot_template_str = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_prompt}<|eot_id|>
<|start_header_id|>{question_role}<|end_header_id|>{{question}}<|eot_id|><|start_header_id|>{machine_role}<|end_header_id|>"""
history_template_str = """<|start_header_id|>{question_role}<|end_header_id|>{{question}}<|eot_id|>\n<|start_header_id|>{machine_role}<|end_header_id|>{{answer}}<|eot_id|>"""
human_message_prompt = HumanMessagePromptTemplate(prompt=PromptTemplate(template=gen_prompot_template_str, input_variables=["question"]))
ai_message_prompt = AIMessagePromptTemplate(prompt=PromptTemplate(template=history_template_str, input_variables=["question", "answer"])) Efficient History ConcatenationUse a function to concatenate chat history efficiently: def concatenate_chat_history(chat_history):
buffer = ""
for dialogue_turn in chat_history:
if isinstance(dialogue_turn, tuple):
human = "Human: " + dialogue_turn[0]
ai = "Assistant: " + dialogue_turn[1]
buffer += "\n" + "\n".join([human, ai])
else:
raise ValueError(f"Unsupported chat history format: {type(dialogue_turn)}. Full chat history: {chat_history}")
return buffer GPU OptimizationConfigure your LLM to utilize GPU efficiently: n_gpu_layers = -1 # Move all layers to GPU
n_batch = 512 # Adjust based on your GPU's VRAM
llm = LlamaCpp(
model_path=str(model_path),
n_gpu_layers=n_gpu_layers,
n_ctx=n_ctx,
n_batch=n_batch,
use_mlock=True,
use_mmap=True,
model_kwargs={"dtype": torch.float16, "use_flash_attention": True},
temperature=0.2,
top_p=1,
rope_freq_base=0,
rope_freq_scale=0,
verbose=False,
) Example UsageIntegrate the above components into your chat application: prompt = ChatPromptTemplate.from_messages(chat_history=[ai_message_prompt, MessagesPlaceholder("chat_history", optional=True), human_message_prompt])
runnable = prompt | llm
store = {}
def get_session_history(session_id: str) -> BaseChatMessageHistory:
if session_id not in store:
store[session_id] = ChatMessageHistory()
return store[session_id]
with_message_history = RunnableWithMessageHistory(runnable, get_session_history, input_messages_key="question", history_messages_key="chat_history", output_messages_key="answer")
session_config = {"configurable": {"session_id": "0"}}
print("Welcome to the interactive chat with LLaMA. Type 'exit' to quit.")
while True:
user_input = input("User: ")
if 'end conversation' in user_input.lower(): break
output = with_message_history.invoke({"question": user_input}, config=session_config)
print(f"Assistant: {output}") By ensuring that your |
Beta Was this translation helpful? Give feedback.
-
this is how i did it for now
|
Beta Was this translation helpful? Give feedback.
this is how i did it for now