Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Race condition: Wrong trace_id sent to Langfuse when Redis caching is enabled #6783

Open
yuriykuzin opened this issue Nov 17, 2024 · 2 comments · May be fixed by #8625
Open

[Bug]: Race condition: Wrong trace_id sent to Langfuse when Redis caching is enabled #6783

yuriykuzin opened this issue Nov 17, 2024 · 2 comments · May be fixed by #8625
Labels

Comments

@yuriykuzin
Copy link

yuriykuzin commented Nov 17, 2024

What happened

When using LiteLLM with Redis caching enabled and making parallel calls, incorrect trace_ids are being sent to Langfuse, despite langfuse_context.get_current_trace_id() returning the correct value. The issue appears to be a race condition that only occurs when Redis caching is enabled - the problem disappears when using in-memory cache only.

LiteLLM version: 1.52.9

Steps to Reproduce

  • Set up LiteLLM with Redis caching and Langfuse integration
  • Run multiple parallel calls using the code below
  • Observe the trace IDs in Langfuse dashboard or add printing trace_id here:
    return {"trace_id": trace_id, "generation_id": generation_id}

Reproduction Code

import asyncio
from litellm import Router
import litellm
from langfuse.decorators import observe
import os
from langfuse.decorators import langfuse_context

# Configuration
MODEL_NAME = "your-model-name"  # Change to your deployment name
API_BASE = "https://your-endpoint.openai.azure.com"  # Insert your api base
API_VERSION = "2023-12-01-preview"
API_KEY = os.getenv("AZURE_API_KEY")
REDIS_URL = "redis://localhost:6379"

# Langfuse configuration
os.environ["LANGFUSE_HOST"] = "your-langfuse-host"
os.environ["LANGFUSE_PUBLIC_KEY"] = "your-public-key"
os.environ["LANGFUSE_SECRET_KEY"] = "your-secret-key"

# Configure LiteLLM callbacks
litellm.success_callback = ["langfuse"]
litellm.failure_callback = ["langfuse"]

# Initialize router
router = Router(
    model_list=[
        {
            "model_name": MODEL_NAME,
            "litellm_params": {
                "model": f"azure/{MODEL_NAME}",
                "api_base": API_BASE,
                "api_key": API_KEY,
                "api_version": API_VERSION,
            },
        }
    ],
    default_litellm_params={"acompletion": True},
    # Once REDIS is enabled here, langfuse integration sends the wrong
    # trace_id in parallel calls:
    redis_url=REDIS_URL,
)


async def call_llm(prompt: str):
    # Correct trace_id is printed here:
    print(
        "get_current_trace_id:",
        langfuse_context.get_current_trace_id(),
    )

    # Surprisingly, acompletion() works good, but we need
    # completions.create() to be fixed, as we need it for integration with
    # Instructor.
    # response = await router.acompletion(

    response = await router.chat.completions.create(
        model=MODEL_NAME,
        messages=[{"role": "user", "content": prompt}],
        metadata={
            "trace_id": langfuse_context.get_current_trace_id(),
            "generation_name": prompt,
            "debug_langfuse": True,
        },
    )
    return response


@observe()
async def process():
    # First call with Request1
    await call_llm("Tell me the result of 2+2")

    # Second call with Request2
    await call_llm("Do you like Math, yes or no?")


async def main():
    # Run two process functions in parallel
    await asyncio.gather(process(), process())


if __name__ == "__main__":
    asyncio.run(main())

Current Behavior

When Redis caching is enabled and parallel calls are made:

langfuse_context.get_current_trace_id() returns the correct trace_id
However, the wrong trace_id is being sent to Langfuse
This can be verified by adding a print statement before line 296 in litellm/integrations/langfuse/langfuse.py

get_current_trace_id: c45394a2-4fa0-4599-aa3c-88a101b35868
get_current_trace_id: fcb74aee-2de0-465e-b1f3-afd4730fe193
Real sent trace_id: fcb74aee-2de0-465e-b1f3-afd4730fe193
get_current_trace_id: c45394a2-4fa0-4599-aa3c-88a101b35868
Real sent trace_id: fcb74aee-2de0-465e-b1f3-afd4730fe193
get_current_trace_id: fcb74aee-2de0-465e-b1f3-afd4730fe193
Real sent trace_id: c45394a2-4fa0-4599-aa3c-88a101b35868
Real sent trace_id: fcb74aee-2de0-465e-b1f3-afd4730fe193

Here c45394a2-4fa0-4599-aa3c-88a101b35868 should be sent twice, but in fact it has been sent only once. And fcb74aee-2de0-465e-b1f3-afd4730fe193 should be sent only twice, but it has been sent 3 times.

Expected Behavior

The correct trace_id should be sent to Langfuse, matching the one returned by langfuse_context.get_current_trace_id()
Trace IDs should remain consistent regardless of whether Redis caching is enabled or not.

In this example here's what is being sent when Redis is disabled:

get_current_trace_id: 3a0d9972-9730-465e-9a63-840e9c8f8fd3
get_current_trace_id: 94e7c707-0bd7-47a9-8e25-bc8f8eca2b6d
Real sent trace_id: 94e7c707-0bd7-47a9-8e25-bc8f8eca2b6d
get_current_trace_id: 94e7c707-0bd7-47a9-8e25-bc8f8eca2b6d
Real sent trace_id: 3a0d9972-9730-465e-9a63-840e9c8f8fd3
get_current_trace_id: 3a0d9972-9730-465e-9a63-840e9c8f8fd3
Real sent trace_id: 94e7c707-0bd7-47a9-8e25-bc8f8eca2b6d
Real sent trace_id: 3a0d9972-9730-465e-9a63-840e9c8f8fd3

Each trace_id has been sent 2 times.

Additional Notes

  • The issue only occurs when Redis caching is enabled.
  • The problem disappears when using in-memory cache only.
  • Interestingly, router.acompletion() works correctly, but router.chat.completions.create() exhibits the issue. This affects integrations that specifically need to use completions.create(), such as Instructor

Possible Investigation Points

Race condition in how trace IDs are handled when Redis caching is enabled.
Difference in trace ID handling between acompletion() and completions.create().

Files to Look At

  • litellm/integrations/langfuse/langfuse.py (specifically around line 296)

Let me know if you need any additional information or clarification.

Relevant log output

No response

Twitter / LinkedIn details

No response

@yuriykuzin yuriykuzin added the bug Something isn't working label Nov 17, 2024
@yuriykuzin
Copy link
Author

Actually, even more, the whole langfuse report during parallel calls sometimes is wrong when Redis caching is enabled.

@ishaan-jaff
Copy link
Contributor

Is this still happening on latest ? Can we get help with more details / how to repro @yuriykuzin ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants