Using Claude prompt caching with ChatPromptTemplate #25610

felixwoestmann · 2024-08-21T08:11:30Z

felixwoestmann
Aug 21, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

prompt = ChatPromptTemplate.from_messages([
        SystemMessage(content=cached_prompt, additional_kwargs={"cache_control": {"type": "ephemeral"}}),
        HumanMessage(content=f"Here is the comment to analyze: {comment}")
    ])

Description

Hello,
I am currently trying to implement Claude's prompt caching in my existing application.

Setting the extra header to enable this Beta works well and the API is returning an overview of how many tokens were written to and read from the cache. However these are always 0, due to my setting the cache-control property of the message wrongly.

On the X.com account of Langchain I found an example to set the cache-control property when providing messages as a Json.

However I am using ChatPromptTemplates and was wondering how I would do it there, currently I am trying something like this:

prompt = ChatPromptTemplate.from_messages([
       SystemMessage(content=cached_prompt, additional_kwargs={"cache_control": {"type": "ephemeral"}}),
       HumanMessage(content=f"Here is the comment to analyze: {comment}")
   ])

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 23.5.0: Wed May 1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000
Python Version: 3.12.0 (v3.12.0:0fb18b02c8, Oct 2 2023, 09:45:56) [Clang 13.0.0 (clang-1300.0.29.30)]

Package Information

langchain_core: 0.2.33
langchain: 0.2.14
langchain_community: 0.2.12
langsmith: 0.1.81
langchain_anthropic: 0.1.23
langchain_mistralai: 0.1.12
langchain_openai: 0.1.22
langchain_text_splitters: 0.2.2
langchain_together: 0.1.5

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.9.5
anthropic: 0.34.1
async-timeout: Installed. No version info available.
dataclasses-json: 0.6.6
defusedxml: 0.7.1
httpx: 0.27.0
httpx-sse: 0.4.0
jsonpatch: 1.33
numpy: 1.26.4
openai: 1.41.1
orjson: 3.10.3
packaging: 23.2
pydantic: 2.7.1
PyYAML: 6.0.1
requests: 2.31.0
SQLAlchemy: 2.0.30
tenacity: 8.3.0
tiktoken: 0.7.0
tokenizers: 0.19.1
typing-extensions: 4.11.0

@felixwoestmann · 2024-08-21T08:15:03Z

dosubot[bot]
bot Aug 21, 2024

Hey there, @felixwoestmann! I'm here to help you with your bug, answer any questions you have, and guide you on becoming a contributor. Let's dive in and work on resolving your issue together.

To implement Claude's prompt caching with ChatPromptTemplate in LangChain, you need to ensure that the cache-control property is correctly set within the additional_kwargs of the SystemMessage. Your current approach is almost correct, but you should use "cache-control" instead of "cache_control".

Here's how you can set it up:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage, HumanMessage

prompt = ChatPromptTemplate.from_messages([
    SystemMessage(content=cached_prompt, additional_kwargs={"cache-control": {"type": "ephemeral"}}),
    HumanMessage(content=f"Here is the comment to analyze: {comment}")
])

This ensures that the cache-control property is correctly recognized and used by the caching mechanism in LangChain.

Additionally, make sure that your caching mechanism is properly configured within the _generate_with_cache method. Here is an example of how the caching mechanism should be set up:

def _generate_with_cache(
    self,
    messages: List[BaseMessage],
    stop: Optional[List[str]] = None,
    run_manager: Optional[CallbackManagerForLLMRun] = None,
    **kwargs: Any,
) -> ChatResult:
    if isinstance(self.cache, BaseCache):
        llm_cache = self.cache
    else:
        llm_cache = get_llm_cache()
    
    check_cache = self.cache or self.cache is None
    if check_cache:
        if llm_cache:
            llm_string = self._get_llm_string(stop=stop, **kwargs)
            prompt = dumps(messages)
            cache_val = llm_cache.lookup(prompt, llm_string)
            if isinstance(cache_val, list):
                return ChatResult(generations=cache_val)
        elif self.cache is None:
            pass
        else:
            raise ValueError(
                "Asked to cache, but no cache found at `langchain.cache`."
            )
    
    result = self._generate(messages, stop=stop, run_manager=run_manager, **kwargs)
    
    for idx, generation in enumerate(result.generations):
        if run_manager and generation.message.id is None:
            generation.message.id = f"run-{run_manager.run_id}-{idx}"
        generation.message.response_metadata = _gen_info_and_msg_metadata(generation)
    
    if len(result.generations) == 1 and result.llm_output is not None:
        result.generations[0].message.response_metadata = {
            **result.llm_output,
            **result.generations[0].message.response_metadata,
        }
    
    if check_cache and llm_cache:
        llm_cache.update(prompt, llm_string, result.generations)
    
    return result

This code ensures that the caching mechanism checks the cache for existing results and updates the cache with new results as needed ^[1].

By setting the cache-control property correctly and ensuring the caching mechanism is properly configured, you should be able to implement Claude's prompt caching effectively.

To continue talking to Dosu, mention @dosu.

3 replies

felixwoestmann Aug 21, 2024
Author

This is plain wrong. I want to use Claudes prompt caching, not the langchain internal mechanism.

dosubot[bot] bot Aug 21, 2024

To implement Claude's prompt caching using ChatPromptTemplate without relying on LangChain's internal caching mechanism, you can create a custom caching mechanism. Here's an example of how you can achieve this:

Define a custom cache class: This class will handle storing and retrieving cached responses.
Override the _generate_with_cache method: Modify this method to use your custom cache instead of LangChain's internal cache.

Here's a sample implementation:

import hashlib
import json
from typing import List, Optional, Any

class CustomCache:
    def __init__(self):
        self.cache = {}

    def _hash_prompt(self, prompt: str) -> str:
        return hashlib.sha256(prompt.encode()).hexdigest()

    def lookup(self, prompt: str, llm_string: str) -> Optional[List[Any]]:
        key = self._hash_prompt(prompt + llm_string)
        return self.cache.get(key)

    def update(self, prompt: str, llm_string: str, generations: List[Any]):
        key = self._hash_prompt(prompt + llm_string)
        self.cache[key] = generations

class CustomChatModel(BaseChatModel):
    def __init__(self, cache: CustomCache):
        self.cache = cache

    def _generate_with_cache(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional<CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> ChatResult:
        llm_string = self._get_llm_string(stop=stop, **kwargs)
        prompt = json.dumps([message.to_dict() for message in messages])
        
        # Check custom cache
        cache_val = self.cache.lookup(prompt, llm_string)
        if cache_val:
            return ChatResult(generations=cache_val)
        
        # Generate result if not in cache
        result = self._generate(messages, stop=stop, run_manager=run_manager, **kwargs)
        
        # Update custom cache
        self.cache.update(prompt, llm_string, result.generations)
        
        return result

    def _generate(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional<CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> ChatResult:
        # Implement the actual generation logic here
        pass

# Usage
custom_cache = CustomCache()
chat_model = CustomChatModel(cache=custom_cache)

In this implementation:

CustomCache is a simple cache class that uses a dictionary to store cached responses.
The _generate_with_cache method in CustomChatModel is overridden to use CustomCache for caching.
The lookup method checks if a response is already cached.
The update method stores the generated response in the cache.

Additionally, you can use ChatPromptTemplate to structure your prompts. For example, you can define a ChatPromptTemplate for refining answers:

from langchain_core.prompts.chat import ChatPromptTemplate

refine_template = (
    "We have the opportunity to refine the existing answer "
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_str}\n"
    "------------\n"
    "Given the new context, refine the original answer to better "
    "answer the question. "
    "If the context isn't useful, return the original answer."
)

CHAT_REFINE_PROMPT = ChatPromptTemplate.from_messages(
    [("human", "{question}"), ("ai", "{existing_answer}"), ("human", refine_template)]
)

You can also use other prompt templates like CONDENSE_QUESTION_PROMPT and QA_PROMPT from LangChain's prompts.py:

from langchain_core.prompts.prompt import PromptTemplate

_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:"""
QA_PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

This way, you can implement prompt caching without relying on LangChain's internal caching mechanism while leveraging ChatPromptTemplate for structured prompts ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Naziyashaik09 Sep 27, 2024

Is it possible to cache the multiple users?

felixwoestmann · 2024-08-21T09:14:09Z

felixwoestmann
Aug 21, 2024
Author

Solved it by doing this:

 return ChatPromptTemplate.from_messages([
        SystemMessage(content=[
            {
                "text": cached_prompt,
                "type": "text",
                "cache_control": {"type": "ephemeral"},
            }
        ]),
        HumanMessage(content=f"Here is the comment to analyze: {comment}")
    ])

7 replies

ShaharZivanOnvego Sep 5, 2024

This doesn't seem to do anything for me. Not getting any indication that the prompted has been cached. This really shouldn't be this difficult...

felixwoestmann Sep 5, 2024
Author

Is the cached part at least 1024 tokens? And did you check if tokens were cached by logging the response headers?

ShaharZivanOnvego Sep 10, 2024

It seems that my problem was, I was using streaming (astream_events) instead of invoke. It was caching it but the events in the response from streaming don't show that. I only managed to verify it's working by looking at my billing in Anthropic, and seeing the "Cache Hit" billing go up every time I run.

yadgire7 Oct 10, 2024

Didn't work with variables for me either. I used Python string interpolation instead of the ChatPromptTemplates variable system.

Did you use f-string to combine the system instructions with the long(fixed) context to make a system prompt?

yadgire7 Oct 10, 2024

I could implement prompt caching using ChatPromptTemplate. I have a different concern. I have the exact same prompt which I used as a PromptTemplate and it worked as intended. Now, the performance has worsened when I use ChatPromptTemplate. Any comments or solution for this?
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Claude prompt caching with ChatPromptTemplate #25610

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Using Claude prompt caching with ChatPromptTemplate #25610

felixwoestmann Aug 21, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

Replies: 2 comments · 10 replies

dosubot[bot] bot Aug 21, 2024

felixwoestmann Aug 21, 2024 Author

dosubot[bot] bot Aug 21, 2024

Naziyashaik09 Sep 27, 2024

felixwoestmann Aug 21, 2024 Author

ShaharZivanOnvego Sep 5, 2024

felixwoestmann Sep 5, 2024 Author

ShaharZivanOnvego Sep 10, 2024

yadgire7 Oct 10, 2024

yadgire7 Oct 10, 2024

felixwoestmann
Aug 21, 2024

Replies: 2 comments 10 replies

dosubot[bot]
bot Aug 21, 2024

felixwoestmann Aug 21, 2024
Author

felixwoestmann
Aug 21, 2024
Author

felixwoestmann Sep 5, 2024
Author