系统信息
文本生成推理:v2.1.0+
驱动版本:535.161.08 CUDA版本:12.2 3
GPU:DGX with 8xH100 80GB
信息
- Docker
- 直接使用CLI
任务
- 一个官方支持的命令
- 我自己的修改
重现过程
我正在使用Docker在DGX上运行TGI,其中有8xH100。docker run --restart=on-failure --env LOG_LEVEL=INFO --gpus all --ipc=host -p 8080:8080 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard 8 --port 8080 --max-input-length 34000 --max-total-tokens 32000 --max-batch-prefill-tokens 128000
一切都能运行,但是我在推理过程中经常遇到崩溃。这种情况发生在多个模型上,但最常见的是WizardLM8x22B。起初我认为这与cuda-graphs有关,但我认为那是一个误导。增加max-batch-prefill-tokens似乎可以减少错误出现的次数。
我认为这可能是与这个问题相同的问题:#1566?
2024-06-26T07:35:08.486443Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1712608935911/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2395, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'device-side assert triggered'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 91, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 261, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 146, in Prefill
generations, next_batch, timings = self.model.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1094, in generate_token
out, speculative_logits = self.forward(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1047, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 651, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 583, in forward
hidden_states = self.embed_tokens(input_ids)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 233, in forward
torch.distributed.all_reduce(out, group=self.process_group)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 77, in wrapper
msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 50, in _get_msg_dict
"args": f"{args}, {kwargs}",
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 464, in __repr__
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 331, in _tensor_str
self = self.float()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-06-26T07:35:08.486444Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1712608935911/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2395, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'device-side assert triggered'
During handling of the above exception, another exception occurred:
...
预期行为
如果环境支持相应的最大批量大小,它应该能够在不出现错误的情况下进行预填充。
4条答案
按热度按时间4xrmg8kj1#
有时候这也会导致服务器无限期挂起,似乎看起来是这样。我会得到一个生成的调试条目,但没有进一步发生的事情:
从我所了解的情况来看,服务器卡住之前的最终输出是:
然后什么都没有,除了上面描述的进一步请求进来的情况。
就在那之前,我得到了一个非常大的块分配:
Allocation: BlockAllocation { blocks: [9100, [...], 177598], block_allocator: BlockAllocator { block_allocator: UnboundedSender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f77f0004800, tail_position: 73 }, semaphore: Semaphore(0), rx_waker: AtomicWaker, tx_count: 2, rx_fields: "..." } } } } }
如果这不是相关的话,我很抱歉,我只是想提供我能想到的每一个突出的信息。
qij5mzcb2#
这里也是
从v2.0.1升级到v2.1.0
1yjd4xko3#
我遇到了一个类似的问题,在升级到v2.1.0版本后,多GPU支持似乎变得不功能。一旦我禁用了分片,问题就得到了缓解。
flmtquvp4#
当我在单个GPU上使用Docker加载模型时,它需要11250GB的GPU内存。如果使用2个分片,那么两个GPU上的内存需求大约相同,是单个分片的两倍。
分片应该将我的模型在两个GPU上拆分成大约一半的大小(对于2个GPU)。
使用TGI CLI进行分片应该是完美的,但命令行界面的推理时间更长。这可能是由于没有安装exllama、vllm和相关库导致的。
你是否有什么建议?