Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HuggingFaceTextGenInference Streaming does not output #15516

Closed
5 of 14 tasks
PeterTucker opened this issue Jan 4, 2024 · 8 comments
Closed
5 of 14 tasks

HuggingFaceTextGenInference Streaming does not output #15516

PeterTucker opened this issue Jan 4, 2024 · 8 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: models Related to LLMs or chat model modules

Comments

@PeterTucker
Copy link

PeterTucker commented Jan 4, 2024

System Info

langchain==0.0.354
text_generation==0.6.1

python:3.10-slim

Who can help?

@agola11 @hwaking

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

Model: TheBloke/Llama-2-7B-Chat-GPTQ, but I've also tried TheBloke/Mistral-7B-OpenOrca-GPTQ
FastAPI example with HuggingFaceTextGenInference streaming:

from fastapi import FastAPI

import langchain
from langchain.llms import HuggingFaceTextGenInference
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from fastapi.responses import StreamingResponse
from pydantic import BaseModel

import os
from fastapi.middleware.cors import CORSMiddleware

app = FastAPI()
langchain.debug = True

# Enable CORS
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # You can specify the list of allowed origins or use "*" for any origin
    allow_credentials=True,
    allow_methods=["*"],  # You can specify the HTTP methods that are allowed
    allow_headers=["*"],  # You can specify the HTTP headers that are allowed
)

# Configuration for local LLM
ai_url = "http://tgi-ai-server:" + str(os.getenv("AI_PORT", 80)) + "/generate"

# Configure the LLM
llm = HuggingFaceTextGenInference(
    inference_server_url=ai_url,
    max_new_tokens=20,
    streaming=True,
)

template = """
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]
"""
prompt_template = PromptTemplate(
    template=template, 
    input_variables= ["prompt"]
)

# Initialize the LLM Chain
llm_chain = LLMChain(llm=llm, prompt=prompt_template)

@app.get("/chat")
async def chat():

    prompt = {"prompt":"What is the Nickelodeon channel?"}
    # Generate the response using the LLM Chain and stream the output
    async def generate():
        for text in llm_chain.run(prompt):
            yield text
            

    return StreamingResponse(generate(), media_type="text/plain")

# Run the server (if running this script directly)
# Use the command: uvicorn script_name:app --reload
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Output:

[chain/start] [1:chain:LLMChain] Entering Chain run with input:
{
  "prompt": "What is the Nickelodeon channel?"
}
[llm/start] [1:chain:LLMChain > 2:llm:HuggingFaceTextGenInference] Entering LLM run with input:
{
  "prompts": [
    "[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\nWhat is the Nickelodeon channel?[/INST]"
  ]
}
[llm/end] [1:chain:LLMChain > 2:llm:HuggingFaceTextGenInference] [765ms] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "",
        "generation_info": null,
        "type": "Generation"
      }
    ]
  ],
  "llm_output": null,
  "run": null
}
[chain/end] [1:chain:LLMChain] [765ms] Exiting Chain run with output:
{
  "text": ""
}

Expected behavior

HuggingFaceTextGenInference does not return any streaming data. Works fine when streaming=False in parameters.

Copy link

dosubot bot commented Jan 4, 2024

🤖

Hello @PeterTucker,

I'm Dosu, a friendly bot here to assist you with LangChain related issues, answer questions, and guide you in becoming a contributor. While we're waiting for a human maintainer, I'm here to help you get a head start on your issue. Let's dive in, shall we?

Based on the information you've provided, it seems like the issue might be related to the handling of special tokens or stop sequences in the HuggingFaceTextGenInference class of the LangChain framework.

In the _stream and _astream methods of this class, a token is only yielded if it is not special and doesn't contain a stop sequence. If the token is special or contains a stop sequence, the text is set to None and thus not yielded. This could be the reason why you're not seeing any streaming data.

Here is the relevant part of the code:

# identify text to yield
text: Optional[str] = None
if res.token.special:
    text = None
elif stop_seq_found:
    text = res.token.text[: res.token.text.index(stop_seq_found)]
else:
    text = res.token.text

# yield text, if any
if text:
    chunk = GenerationChunk(text=text)
    yield chunk
    if run_manager:
        run_manager.on_llm_new_token(chunk.text)

# break if stop sequence found
if stop_seq_found:
    break

This code is part of the _stream and _astream methods in the HuggingFaceTextGenInference class. You can find the full code in the libs/langchain/langchain/llms/huggingface_text_gen_inference.py file.

If the tokens being generated are special or contain a stop sequence, you might want to adjust your parameters or the text you're feeding into the model. If this doesn't solve your issue, could you please provide more information about the specific conditions under which this issue occurs? For example, does this happen for all prompts or only specific ones? Any additional information would be helpful in diagnosing the issue.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added Ɑ: models Related to LLMs or chat model modules 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jan 4, 2024
@PeterTucker
Copy link
Author

@dosu-bot
Also streaming via curl from command line works fine:

curl 127.0.0.1:3331/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
    -H 'Content-Type: application/json'

Output:

data:{"token":{"id":13,"text":"\n","logprob":-0.3269043,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":13,"text":"\n","logprob":-2.6054688,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":2772,"text":"De","logprob":-0.017089844,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":1022,"text":"ep","logprob":-0.000064492226,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":6509,"text":" learning","logprob":-0.004310608,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":338,"text":" is","logprob":-0.15283203,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":263,"text":" a","logprob":-0.0038261414,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":11306,"text":" subset","logprob":-0.11029053,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":310,"text":" of","logprob":-0.000018119812,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":4933,"text":" machine","logprob":-0.0003349781,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":6509,"text":" learning","logprob":-0.0001887083,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":393,"text":" that","logprob":-0.0065727234,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":20789,"text":" involves","logprob":-0.004333496,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":278,"text":" the","logprob":-0.0018758774,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":671,"text":" use","logprob":-0.00043320656,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":310,"text":" of","logprob":-0.0000046491623,"special":false},"generated_text":null,"details":null}

data:{"token":{"id":23116,"text":" artificial","logprob":-0.14477539,"special":false},"generated_text":"\n\nDeep learning is a subset of machine learning that involves the use of artificial","details":null}

Copy link

dosubot bot commented Jan 4, 2024

🤖

Based on the information you've provided, it seems like you're correctly using the streaming=True parameter with the HuggingFaceTextGenInference class and the LLMChain class. However, the issue might be with how you're using the StreamingResponse in FastAPI.

In your current implementation, you're using a synchronous generator function generate() with the StreamingResponse. However, FastAPI's StreamingResponse expects an asynchronous generator when running in an async context. This might be why you're not seeing any output when you run your FastAPI application.

Here's how you can modify your generate() function to be an asynchronous generator:

@app.get("/chat")
async def chat():

    prompt = {"prompt":"What is the Nickelodeon channel?"}
    # Generate the response using the LLM Chain and stream the output
    async def generate():
        async for text in llm_chain.run(prompt):
            yield text
            

    return StreamingResponse(generate(), media_type="text/plain")

Please note that the run method of the LLMChain class should also be an asynchronous generator when the streaming=True parameter is set in the HuggingFaceTextGenInference class. If it's not, you might need to modify the run method or create a new method that is an asynchronous generator.

Regarding the issue you found in the LangChain repository (openai tools don't work with streaming=True), it seems to be related to a different problem where the streaming=True parameter was not working with OpenAI tools. The issue was fixed in #14613, but it doesn't seem to be directly related to your problem.

I hope this helps! If you have any other questions or if this doesn't solve your problem, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@PeterTucker
Copy link
Author

@dosu-bot @hwchase17 Wondering if it might be an issue similar to this:
https://github.com/langchain-ai/langchain/pull/6211/commits

Copy link

dosubot bot commented Jan 4, 2024

🤖

I'm unable to access external links or specific pull requests from the LangChain repository. However, if you could provide me with more details or context about the issue you're encountering, I'd be more than happy to help troubleshoot and provide guidance.


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@PeterTucker
Copy link
Author

@dosu-bot Your code did not work

@app.get("/chat")
async def chat():

    prompt = {"prompt":"What is the Nickelodeon channel?"}
    # Generate the response using the LLM Chain and stream the output
    async def generate():
        async for text in llm_chain.run(prompt):
            yield text
            

    return StreamingResponse(generate(), media_type="text/plain")

This is the error I received:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/starlette/responses.py", line 255, in wrap
    await func()
  File "/usr/local/lib/python3.10/site-packages/starlette/responses.py", line 244, in stream_response
    async for chunk in self.body_iterator:
  File "/app/fastapi_server.py", line 78, in generate
    async for text in llm_chain.run(prompt):
TypeError: 'async for' requires an object with __aiter__ method, got str

Copy link

dosubot bot commented Jan 6, 2024

🤖

Based on the information you've provided and the context from the LangChain repository, it seems like the issue might be due to the return type of the run method in the llm_chain object.

From the context, it appears that the run method is not explicitly defined in the LLMChain class. However, the class has apply and aapply methods which are synchronous and asynchronous respectively. Both methods return a list of dictionaries (List[Dict[str, str]]).

If the run method you're referring to is similar to these methods, then it's returning a list of dictionaries, not an asynchronous generator. This would explain the error message you're seeing: 'async for' requires an object with __aiter__ method, got str.

To resolve this issue, you would need to ensure that the run method returns an asynchronous generator when streaming=True. This could be achieved by modifying the run method to yield results one at a time when streaming=True, rather than returning a list of results all at once.

Here's an example of how you might modify the run method to return an asynchronous generator:

async def run(self, prompt, streaming=False):
    if streaming:
        # If streaming, yield results one at a time
        for result in self._run(prompt):
            yield result
    else:
        # If not streaming, return all results at once
        return self._run(prompt)

In this example, _run is a hypothetical method that implements the actual logic of the run method. You would need to replace this with the actual method or logic that you're using to generate results.

Please note that this is just a suggestion based on the provided context and the actual solution might differ based on the exact implementation of the run method and the llm_chain object.

Regarding the pull request you mentioned, without more context about the changes made in that pull request, it's hard to say whether it's related to the issue you're experiencing. If the pull request modifies the run method or the HuggingFaceTextGenInference class in a way that affects their return types or their behavior when streaming=True, then it could potentially be related.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@PeterTucker
Copy link
Author

Just to close this, it was a mistake on my end. URL should have had '_stream' on the end. wompidy womp

ai_url = "http://tgi-ai-server:" + str(os.getenv("AI_PORT", 80)) + "/generate_stream"

not

ai_url = "http://tgi-ai-server:" + str(os.getenv("AI_PORT", 80)) + "/generate"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: models Related to LLMs or chat model modules
Projects
None yet
Development

No branches or pull requests

1 participant