我正在创建一个带有torch、gunicorn和flask的推理服务,该服务应该使用CUDA。为了减少资源需求,我使用了gunicorn的preload选项,这样模型就可以在工作进程之间共享。然而,这会导致CUDA出现问题。下面的代码片段显示了一个最小的复制示例:
from flask import Flask, request
import torch
app = Flask('dummy')
model = torch.rand(500)
model = model.to('cuda:0')
@app.route('/', methods=['POST'])
def f():
data = request.get_json()
x = torch.rand((data['number'], 500))
x = x.to('cuda:0')
res = x * model
return {
"result": res.sum().item()
}
使用CUDA_VISIBLE_DEVICES=1 gunicorn -w 3 -b $HOST_IP:8080 --preload run_server:app
启动服务器可使服务成功启动。但是,在发出第一个请求(curl -X POST -d '{"number": 1}'
)后,工作线程将引发以下错误:
[2022-06-28 09:42:00,378] ERROR in app: Exception on / [POST]
Traceback (most recent call last):
File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/home/user/.local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/home/user/project/run_server.py", line 14, in f
x = x.to('cuda:0')
File "/home/user/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 195, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
我在父进程中加载了模型,每个分叉的工作进程都可以访问它。在工作进程中创建CUDA支持的Tensor时出现了问题。这会在工作进程中重新初始化CUDA上下文,但由于它已经在父进程中初始化,因此会失败。如果我们设置x = data['number']
并删除x = x.to('cuda:0')
,则推理成功。
添加torch.multiprocessing.set_start_method('spawn')
或multiprocessing.set_start_method('spawn')
不会改变任何东西,可能是因为gunicorn在使用--preload
选项启动时肯定会使用fork
。
一个解决方案可能是不使用--preload
选项,这会导致模型在内存/GPU中的多个副本。但这是我试图避免的。
有没有可能 * 不 * 在每个工作进程中单独加载模型就能解决这个问题?
1条答案
按热度按时间pgvzfuti1#
你可以用gevent代替gunivorn,我用它解决了这个问题。