PyTorch错误:“运行时错误:CUDA out of memory“(CUDA内存不足),但仍报告有足够的可用内存

p8h8hvxi  于 2022-12-13  发布在  其他
关注(0)|答案(1)|浏览(1611)

尝试通过InvokeAI(类似于Automatic 1111的GUI)运行稳定扩散模型时,在执行GPU密集型操作(如图像生成或加载另一个模型)时,我在看似随机的点收到以下PyTorch/CUDA错误:

>> Model change requested: stable-diffusion-1.5
>> Current VRAM usage:  0.00G
>> Offloading custom-elysium-anime-v2 to CPU
>> Scanning Model: stable-diffusion-1.5
>> Model Scanned. OK!!
>> Loading stable-diffusion-1.5 from G:\invokeAI\bullshit\models\ldm\stable-diffusion-v1\v1-5-pruned-emaonly.ckpt
** model stable-diffusion-1.5 could not be loaded: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 13107200 bytes.
Traceback (most recent call last):
  File "g:\invokeai\ldm\invoke\model_cache.py", line 80, in get_model
    requested_model, width, height, hash = self._load_model(model_name)
  File "g:\invokeai\ldm\invoke\model_cache.py", line 228, in _load_model
    sd = torch.load(io.BytesIO(weight_bytes), map_location='cpu')
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\serialization.py", line 1049, in _load
    result = unpickler.load()
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\serialization.py", line 1019, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\serialization.py", line 997, in load_tensor
    storage = zip_file.get_storage_from_record(name, numel, torch._UntypedStorage).storage()._untyped()
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 13107200 bytes.

** restoring custom-elysium-anime-v2
>> Retrieving model custom-elysium-anime-v2 from system RAM cache

Traceback (most recent call last):
  File "g:\invokeai\ldm\invoke\model_cache.py", line 80, in get_model
    requested_model, width, height, hash = self._load_model(model_name)
  File "g:\invokeai\ldm\invoke\model_cache.py", line 228, in _load_model
    sd = torch.load(io.BytesIO(weight_bytes), map_location='cpu')
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\serialization.py", line 1049, in _load
    result = unpickler.load()
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\serialization.py", line 1019, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\serialization.py", line 997, in load_tensor
    storage = zip_file.get_storage_from_record(name, numel, torch._UntypedStorage).storage()._untyped()
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 13107200 bytes.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "g:\invokeai\backend\invoke_ai_web_server.py", line 301, in handle_set_model
    model = self.generate.set_model(model_name)
  File "g:\invokeai\ldm\generate.py", line 843, in set_model
    model_data = cache.get_model(model_name)
  File "g:\invokeai\ldm\invoke\model_cache.py", line 93, in get_model
    self.get_model(self.current_model)
  File "g:\invokeai\ldm\invoke\model_cache.py", line 73, in get_model
    self.models[model_name]['model'] = self._model_from_cpu(requested_model)
  File "g:\invokeai\ldm\invoke\model_cache.py", line 371, in _model_from_cpu
    model.to(self.device)
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\pytorch_lightning\core\mixins\device_dtype_mixin.py", line 113, in to
    return super().to(*args, **kwargs)
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 927, in to
    return self._apply(convert)
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
    module._apply(fn)
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
    module._apply(fn)
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 602, in _apply
    param_applied = fn(param)
  File "G:\invokeAI\installer_files\env\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 802.50 KiB already allocated; 6.10 GiB free; 2.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

其中引人注目的一行是:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 802.50 KiB already allocated; 6.10 GiB free; 2.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

通常情况下,我认为这只是因为VRAM不足,正如消息所示,然而,它试图分配20 MiB,同时报告6.1 GiB可用。这就是为什么我不知道为什么出现错误消息。
这是运行在RTX 2080与版本527.56驱动程序。
我一直无法找到任何解决方案,我可以适用于我的情况。我希望它能成功分配内存,因为比目前分配的要多得多的可用内存,但却发生了这个错误。

yhxst69z

yhxst69z1#

突出显示的行是从之前得到的另一个异常派生的:

RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 13107200 bytes.

从你的错误堆栈跟踪来看,问题是在模型检查点的加载过程中发生的。也许,你安装了比他们构建Invoke AI的版本更新的pytorch版本吗?而且,他们声明你至少需要12 GB的RAM --是这样吗?

相关问题