设置LangChain自定义LLM管道时出错-PyTorch中的“CUDA内存不足”

2jcobegt 于 2023-10-20 发布在其他

关注(0)|答案(1)|浏览(261)

我目前正在使用its documentation为LangChain设置一个自定义LLM管道，特别是为LLM使用“upstage/llama-65b-instruct“。在试验upstages示例代码时，我确保我的GPU有足够的容量来加载和利用这个特定的LLM。
然而，当尝试遵循LangChain的自定义LLM指令时，我遇到了一个错误：

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.64 GiB total capacity; 22.79 GiB already allocated; 42.50 MiB free; 23.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

这是我的设置：

from langchain.llms.base import LLM
import torch

class CustomLLM(LLM):
    model_name = "upstage/llama-65b-instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        torch_dtype=torch.float16,
        load_in_8bit=True,
        rope_scaling={"type": "dynamic", "factor": 2}  # allows handling of longer inputs
    )

    def _call(self, prompt, stop=None, **kwargs):
        prompt = f"### User:\n{prompt}\n\n### Assistant:\n"
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        output = self.model.generate(**inputs, streamer=self.streamer, use_cache=True, max_new_tokens=float('inf'))
        return self.tokenizer.decode(output[0], skip_special_tokens=True)

    @property
    def _identifying_params(self):
        return {"name_of_model": self.model_name}

    @property
    def _llm_type(self):
        return "custom"

llm_predictor = LLMPredictor(llm=CustomLLM())

我想知道我现有的设置是否有任何问题，或者是否需要任何其他步骤或配置来解决此问题。任何见解或指导将不胜感激。

pytorch

来源：https://stackoverflow.com/questions/76897595/error-in-setting-up-langchain-custom-llm-pipeline-cuda-out-of-memory-in-pyto

1条答案

按热度按时间

e5nqia271#

对于Llama 2 70 b型号，您需要至少40 GB的VRAM。只有24 GB：23.64 GiB total capacity。
一些选项：

（始终推荐）使用已量化的模型（即，尺寸减小而性能没有损失太多）。TheBloke's拥抱面页有100个这样的模型，你可以免费使用，例如。TheBloke/Llama-2-70B-chat-GPTQ仅占用35 GB。使用未量化的70 b模型将占用135 GB。
使用更小的模型（在参数方面）
租用内存更大的GPU

如果你想使用最大的模型，租一个GPU（我喜欢vast.ai）并选择量化版本。截至2023年10月，租用具有48 GB VRAM的A40 GPU的成本为0.40美元/小时，可以轻松运行Llama 2 - 70 b。

赞(0）回复(0）举报 2023-10-20

我来回答

设置LangChain自定义LLM管道时出错-PyTorch中的“CUDA内存不足”

1条答案

相关问题

热门标签

最新问答