input_documents taking an absurd amount of memory #8520

alxfordy · 2023-07-31T14:51:30Z

alxfordy
Jul 31, 2023

I am trying to setup retrieval augmented generation. I can get all of my documents in the correct format and pass it to the prompt but for a document that is XX MBs, I get a CUDA out of Memory on an A100 in Colab.

For a 122.05 KB document I get the error

`OutOfMemoryError                          Traceback (most recent call last)
<ipython-input-25-226234bd6dea> in <cell line: 1>()
----> 1 chain.run(input_documents=docs, question=query)

35 frames
/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py in forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
    328         value_states = repeat_kv(value_states, self.num_key_value_groups)
    329 
--> 330         attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
    331 
    332         if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):

OutOfMemoryError: CUDA out of memory. Tried to allocate 47.20 GiB (GPU 0; 15.77 GiB total capacity; 6.68 GiB already allocated; 6.42 GiB free; 8.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Code Below


from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

chain.run(input_documents=docs, question=query)

2023-07-31T15:13:11Z

dosubot[bot]
bot Jul 31, 2023

Answer generated by a 🤖

Answer

I understand that you're encountering a CUDA out of memory error when trying to process a large document using the 'meta-llama/Llama-2-7b-chat-hf' model on a GPU with limited memory. This error arises because the model is attempting to allocate more memory than is available on your GPU.

Here's how we can potentially address this issue:

Reduce the size of your document: If feasible, you could try to decrease the size of the document you're processing. This could involve breaking the document into smaller segments and processing each segment individually.
Switch to a smaller model: If downsizing the document isn't an option, you might consider using a smaller model that requires less GPU memory.
Use a GPU with more memory: If you have access to a GPU with more memory, using that could potentially solve the problem.
Implement gradient checkpointing: Gradient checkpointing is a technique that can decrease GPU memory usage, albeit at the cost of increased computation time. You can activate gradient checkpointing in PyTorch by setting the gradient_checkpointing attribute of the model configuration to True.

Here's how you can enable gradient checkpointing:

model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth,
    gradient_checkpointing=True,  # Enable gradient checkpointing
)

Please note that while gradient checkpointing can help reduce memory usage, it can significantly increase computation time. Therefore, it's a trade-off between memory usage and computation time.

This is one potential solution to your problem. If you believe this is a bug that could impact other users and you have a fix, we encourage you to make a pull request. If this solution doesn't align with your experience, it might be worth upgrading to the latest version of the software.

Sources

docs/extras/modules/memory/adding_memory_chain_multiple_inputs.ipynb

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

input_documents taking an absurd amount of memory #8520

{{title}}

Replies: 1 comment

{{title}}

Select a reply

input_documents taking an absurd amount of memory #8520

alxfordy Jul 31, 2023

Replies: 1 comment

dosubot[bot] bot Jul 31, 2023

Answer

Sources

alxfordy
Jul 31, 2023

dosubot[bot]
bot Jul 31, 2023