从代码开始
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
它不起作用(在A100 python 3.10和cuda12.1上)
ImportError: torch_extensions/py310_cu121/ragged_device_ops/ragged_device_ops.so:无法打开共享对象文件:没有这样的文件或目录
从代码开始
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
它不起作用(在A100 python 3.10和cuda12.1上)
ImportError: torch_extensions/py310_cu121/ragged_device_ops/ragged_device_ops.so:无法打开共享对象文件:没有这样的文件或目录
5条答案
按热度按时间ljsrvy3e1#
你好,@mohbay - 你能分享一下你的ds_report吗?我的猜测是你没有为这些操作安装deepspeed-kernels/cutlass kernels。
1mrurvl12#
你好,@loadams。这可能确实与cutlass有关。非常感谢。以下是ds_report的内容:
[INFO] [real_accelerator.py:203:get_accelerator] 设置ds_accelerator为cuda(自动检测)
[WARNING] 请将CUTLASS仓库目录指定为环境变量$CUTLASS_PATH
[WARNING] sparse_attn需要torch版本>=1.5且<2.0,但检测到的是2.3
[WARNING] 使用未经测试的triton版本(2.3.1),仅知道1.0.0与兼容
DeepSpeed C++/CUDA扩展操作报告
注意:未安装的操作将在运行时(如果需要)即时编译(JIT)。操作兼容性意味着您的系统满足JIT安装操作所需的依赖关系。
JIT编译的操作需要ninja
ninja ............... [OKAY]
操作名称 ......已安装...兼容
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] 请将CUTLASS仓库目录指定为环境变量$CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference . [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn需要torch版本>=1.5且<2.0,但检测到的是2.3
[WARNING] 使用未经测试的triton版本(2.3.1),仅知道1.0.0与兼容
sparse_attn ......... [NO] ....... [NO]
spatial_inference ... [NO] ....... [OKAY]
transformer ......... [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
pdkcd3nj3#
Can you share your
pip list
as well? Or have you installeddeepspeed-kernels
?emeijp434#
deepspeed-kernels is installed. Here is the pip list.
Package Version
aniso8601 9.0.1
annotated-types 0.7.0
asyncio 3.4.3
blinker 1.8.2
certifi 2024.7.4
charset-normalizer 3.3.2
click 8.1.7
cmake 3.30.0
deepspeed 0.14.4
deepspeed-kernels 0.0.1.dev1698255861
deepspeed-mii 0.2.3
filelock 3.15.4
Flask 3.0.3
Flask-RESTful 0.3.10
fsspec 2024.6.1
grpcio 1.64.1
grpcio-tools 1.64.1
hjson 3.1.0
huggingface-hub 0.23.5
idna 3.7
itsdangerous 2.2.0
Jinja2 3.1.4
MarkupSafe 2.1.5
mpmath 1.3.0
networkx 3.3
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.555.43
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.5.82
nvidia-nvtx-cu12 12.1.105
packaging 24.1
pillow 10.4.0
pip 22.0.2
protobuf 5.27.2
psutil 6.0.0
py-cpuinfo 9.0.0
pydantic 2.8.2
pydantic_core 2.20.1
pynvml 11.5.2
pytz 2024.1
PyYAML 6.0.1
pyzmq 26.0.3
regex 2024.5.15
requests 2.32.3
safetensors 0.4.3
setuptools 59.6.0
six 1.16.0
sympy 1.13.0
tokenizers 0.19.1
torch 2.3.1
tqdm 4.66.4
transformers 4.41.2
triton 2.3.1
typing_extensions 4.12.2
ujson 5.10.0
urllib3 2.2.2
Werkzeug 3.0.3
zmq 0.0.0
r6vfmomb5#
请尝试以下步骤解决问题:
确保已正确安装了所有必要的依赖项,如
ninja
,clang
,libtorch
,libc++
等。检查
ragged_device_ops
扩展的源代码,确保没有语法错误或其他问题。尝试使用不同的编译器或构建工具(如
gcc
或clang
)重新编译ragged_device_ops
扩展。如果问题仍然存在,可以尝试在GitHub上查找相关的issue或提交,看看是否有其他人遇到了类似的问题并提供了解决方案。