I need help implementing a backoff to deal with the OpenAI token limit. #10576

xio28 · 2023-09-14T10:21:30Z

xio28
Sep 14, 2023

Hi everyone,

I've been working with langchain for a short time, but recently, when trying to vectorize and store multiple documents in chromadb, I've hit the token-per-minute limit for the ada-embeddings-02 model. So, I tried implementing a simple backoff for this (even tried a solution from this discussion), but I ended up exceeding the daily request limit as well. I'm wondering if langchain has already introduced a class or method to handle this. If not, I'd appreciate any assistance from someone who's faced this issue and managed to resolve it.

Initially, in my project, users can upload several PDFs at once. So, if one exceeds the 300,000 tokens-per-minute limit, it's obvious the others won't get processed.

I based my code on a tutorial, so there might be some remnants from it. Below is the section where I process a file, clean it, extract its text and pages, and store it using the Chroma.from_documents() method:

raw_pages, metadata = cls._parse_document(file_path)

# Step 2: Create text chunks
cleaning_functions = [
    clean_and_standardize_text
]
cleaned_text_pdf = cls._clean_text(raw_pages,
                                    cleaning_functions)
document_chunks = cls._text_to_docs(cleaned_text_pdf, metadata)

vector_store = Chroma.from_documents(
    documents=document_chunks,
    embedding=cls.embedding,
    collection_name="documents",
    persist_directory=chroma_folder,
)

# Save DB locally
vector_store.persist()

So, I believe the other important method is "_text_to_docs". This is where the text is turned into chunks:

def _text_to_docs(cls, text: List[Tuple[int, str]], metadata: Dict[str, str]) -> List[Document]:
    """
        Converts list of strings to a list of Documents with metadata.
    """
    doc_chunks = []

    for page_num, page in text:
        try:
            text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=800,
                separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
                chunk_overlap=20,
            )
            chunks = text_splitter.split_text(page)
            for i, chunk in enumerate(chunks):
                doc = Document(
                    page_content=chunk,
                    metadata={
                        "page_number": page_num,
                        "chunk": i,
                        "source": f"p{page_num}-{i}",
                        **metadata,
                    },
                )
                doc_chunks.append(doc)

        except Exception as e:
            logging.error(f"Error trying to process page {page_num}: {str(e)}")

    return doc_chunks

_process_document extracts metadata, and _clean_text, as its name suggests, cleans text using a loop.

I believe that covers all the important code. Does anyone have a solution for this?

Thanks in advance!

@dosu-bot · 2023-09-14T10:28:47Z

dosubot[bot]
bot Sep 14, 2023

🤖

Hello,

Based on the information you've provided and the context from the LangChain repository, it seems like you're encountering rate limit issues when trying to process large amounts of text with the ada-embeddings-02 model.

There was a similar issue reported in the LangChain repository titled Langchain QA over large documents results in Rate limit errors. However, it doesn't seem like LangChain has a built-in mechanism for handling rate limits when making requests to OpenAI.

To handle this, you can implement a backoff mechanism in your code to avoid exceeding the OpenAI token limit. You can modify the _generate and _agenerate methods in your code to check the token usage after each generation and pause if the token limit is about to be exceeded.

Here's how you can modify the _generate method:

def _generate(
        self,
        prompts: List[str],
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> LLMResult:
        params = self._invocation_params
        params = {**params, **kwargs}
        sub_prompts = self.get_sub_prompts(params, prompts, stop)
        choices = []
        token_usage: Dict[str, int] = {}
        _keys = {"completion_tokens", "prompt_tokens", "total_tokens"}
        for _prompts in sub_prompts:
            if self.streaming:
                if len(_prompts) > 1:
                    raise ValueError("Cannot stream results with multiple prompts.")
            response = completion_with_retry(
                self, prompt=_prompts, run_manager=run_manager, **params
            )
            choices.extend(response["choices"])
            update_token_usage(_keys, response, token_usage)
            # Check if token limit is about to be exceeded
            if token_usage["total_tokens"] > SOME_TOKEN_LIMIT:
                # Pause for some time
                time.sleep(SOME_BACKOFF_TIME)
        return self.create_llm_result(choices, prompts, token_usage)

And here's how you can modify the _agenerate method:

async def _agenerate(
        self,
        prompts: List[str],
        stop: Optional[List[str]] = None,
        run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> LLMResult:
        params = self._invocation_params
        params = {**params, **kwargs}
        sub_prompts = self.get_sub_prompts(params, prompts, stop)
        choices = []
        token_usage: Dict[str, int] = {}
        _keys = {"completion_tokens", "prompt_tokens", "total_tokens"}
        for _prompts in sub_prompts:
            if self.streaming:
                if len(_prompts) > 1:
                    raise ValueError("Cannot stream results with multiple prompts.")
            response = await acompletion_with_retry(
                self, prompt=_prompts, run_manager=run_manager, **params
            )
            choices.extend(response["choices"])
            update_token_usage(_keys, response, token_usage)
            # Check if token limit is about to be exceeded
            if token_usage["total_tokens"] > SOME_TOKEN_LIMIT:
                # Pause for some time
                await asyncio.sleep(SOME_BACKOFF_TIME)
        return self.create_llm_result(choices, prompts, token_usage)

In both methods, SOME_TOKEN_LIMIT should be replaced with the actual token limit you want to set, and SOME_BACKOFF_TIME should be replaced with the amount of time you want to pause when the token limit is about to be exceeded.

Please note that these modifications should be made in the openai.py file in the libs/langchain/langchain/llms/ directory of the LangChain repository.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I need help implementing a backoff to deal with the OpenAI token limit. #10576

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

I need help implementing a backoff to deal with the OpenAI token limit. #10576

xio28 Sep 14, 2023

Replies: 1 comment

dosubot[bot] bot Sep 14, 2023

Sources

xio28
Sep 14, 2023

dosubot[bot]
bot Sep 14, 2023