我目前正在使用2×H800 GPU进行llama2-13b推理,以下是我的代码:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1'
import torch
import mii
from transformers import LlamaTokenizer
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
print(f'world_size: {world_size}, local_rank: {local_rank}')
base_model = '/data/wlx/Llama-2-13b-chat-hf/'
input_text = [' DeepSpeed is a useful tools ' for _ in range(1000)]
input_text = ''.join(input_text)
tokenizer = LlamaTokenizer.from_pretrained(base_model, use_fast=True)
input_ids, _ = tokenizer(input_text, return_tensors='pt').values()
input_ids = input_ids[0]
pipe = mii.pipeline(base_model)
for i in range(1, 4):
cur_text = input_ids[:i * 512]
if local_rank == 0:
print(f'\nToken Shape: {cur_text.shape}\n')
cur_text = tokenizer.decode(cur_text, skip_special_tokens=True,
clean_up_tokenization_spaces=False)
with torch.no_grad():
output = pipe(cur_text,
max_length=8172,
min_new_tokens=1,
max_new_tokens=256,
ignore_eos=False,
do_sample=False,
return_full_text=False)
if local_rank == 0:
output = output[0]
print(f'prompt_length: {output.prompt_length}, \
output_length: {output.generated_length}, \
finished_reason: {output.finish_reason}')
为了准确控制推理长度,我将指定数量的令牌解码为文本作为输入。通常情况下,在for循环执行2-3次后会发生死锁,如下所示的日志:
[2023-12-29 21:09:48,975] [INFO] [engine_v2.py:84:__init__] Model built.
[2023-12-29 21:09:52,020] [INFO] [engine_v2.py:84:__init__] Model built.
[2023-12-29 21:09:55,399] [INFO] [kv_cache.py:135:__init__] Allocating KV-cache 0 with shape: (40, 617, 64, 2, 20, 128) consisting of 617 blocks.
[2023-12-29 21:09:55,399] [INFO] [kv_cache.py:135:__init__] Allocating KV-cache 0 with shape: (40, 617, 64, 2, 20, 128) consisting of 617 blocks.
Token Shape: torch.Size([512])
prompt_length: 512, output_length: 256, finished_reason: length
Token Shape: torch.Size([1024])
prompt_length: 1024, output_length: 256, finished_reason: length
Token Shape: torch.Size([1536])
Deadlock detected. Resetting KV cache and recomputing requests. Consider limiting number of concurrent requests or decreasing max lengths of prompts/generations.
[2023-12-29 21:10:42,060] [INFO] [launch.py:347:main] Process 1872027 exits successfully.
[2023-12-29 21:11:26,107] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1872026
[2023-12-29 21:11:26,107] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1872027
[2023-12-29 21:11:26,108] [ERROR] [launch.py:321:sigkill_handler] ['/home/lxwei/miniconda3/envs/longlora/bin/python', '-u', 'inference_test/mii_issue.py', '--local_rank=1'] exits with return code = -15
然而,我在只循环一次的for循环中输入更多的令牌,它可以正常运行,日志如下:
[2023-12-29 21:14:49,428] [INFO] [engine_v2.py:84:__init__] Model built.
[2023-12-29 21:14:50,419] [INFO] [engine_v2.py:84:__init__] Model built.
[2023-12-29 21:14:53,906] [INFO] [kv_cache.py:135:__init__] Allocating KV-cache 0 with shape: (40, 617, 64, 2, 20, 128) consisting of 617 blocks.
[2023-12-29 21:14:53,908] [INFO] [kv_cache.py:135:__init__] Allocating KV-cache 0 with shape: (40, 617, 64, 2, 20, 128) consisting of 617 blocks.
Token Shape: torch.Size([2048])
prompt_length: 2048, output_length: 256, finished_reason: length
[2023-12-29 21:15:20,969] [INFO] [launch.py:347:main] Process 1873408 exits successfully.
[2023-12-29 21:15:20,969] [INFO] [launch.py:347:main] Process 1873409 exits successfully.
1条答案
按热度按时间vshtjzan1#
你好,@CxsGhost,请阅读更多关于死锁的信息,这里和这里。本质上,问题将会发生,如果你的内存不足以计算所有放置在推理引擎上的提示的下一个标记。我们正在努力解决这个问题,避免这种情况的发生,我会在有解决方案时分享更新。谢谢!