建议改进性能
我有两个节点(4个和4个),每个节点上都有8个GPU。其中一个节点上有4个3090,另一个节点上有3个3090,还有一个3080。3090的显存为24GB,而3080的显存仅为12GB。因此,当我加载一个大型模型,如llama3 70b时,该模型会拆分,每个GPU占用约16GB,我会收到OOM错误。我们还可以使用另一个稍小一点的模型作为示例,只要它最终拆分后的显存大于12GB即可。
我已经找到了一些解决方法,并认为将其提出来并看看未来是否可以/会不会实现是很有趣的。
使用 accelerate
Accelerate有一个名为get_balanced_memory的实用程序,它在将模型加载到内存进行推理时计算max_memory字典。这会自动计算如何拆分模型,如果有多个具有不同显存量的GPU。也可以手动设置。
手动获取GPU内存
import torch
for i in range(torch.cuda.device_count()):
print(i, torch.cuda.get_device_properties(i).total_memory)
然后可以使用自定义device_map,或者使用以下加速方法之一:
使用 torch
您可以计算模型中的层数,并使用gpu内存列表(这只是一个MVP,但仅是一个原型)
import torch
import torch.nn as nn
from transformers import AutoConfig, AutoTokenizer
from transformers.models.mistral.modeling_mistral import MistralForCausalLM
# Step 1: Calculate the memory of each GPU
def get_gpu_memory():
gpu_memory = []
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
props = torch.cuda.get_device_properties(i)
gpu_memory.append(props.total_memory)
return gpu_memory
# Step 2: Manually split the model layers across GPUs
class DistributedModel(nn.Module):
def __init__(self, model, gpu_memory):
super(DistributedModel, self).__init__()
self.gpu_layers = nn.ModuleList()
total_memory = sum(gpu_memory)
proportions = [mem / total_memory for mem in gpu_memory]
layers = list(model.model.layers.children())
num_layers = len(layers)
layers_per_gpu = [int(p * num_layers) for p in proportions]
# Adjust to make sure the total layers assigned equals num_layers
diff = num_layers - sum(layers_per_gpu)
for i in range(diff):
layers_per_gpu[i % len(layers_per_gpu)] += 1
# Allocate layers to GPUs
current_layer = 0
for i, num in enumerate(layers_per_gpu):
device = torch.device(f'cuda:{i}')
gpu_layers = layers[current_layer:current_layer + num]
self.gpu_layers.append(nn.Sequential(*gpu_layers).to(device))
current_layer += num
self.embedding = model.model.embed_tokens.to(torch.device('cuda:0'))
self.ln_f = model.model.norm.to(torch.device(f'cuda:{len(gpu_memory) - 1}')))
self.head = model.lm_head.to(torch.device(f'cuda:{len(gpu_memory) - 1}'))
def forward(self, input_ids):
x = self.embedding(input_ids)
for i, layers in enumerate(self.gpu_layers):
device = torch.device(f'cuda:{i}')
x = x.to(device)
for layer in layers:
x = layer(x)
if isinstance(x, tuple):
x = x[0] # Ensure we are working with tensors, not tuples
x = self.ln_f(x.to(torch.device(f'cuda:{len(self.gpu_layers) - 1}')))
logits = self.head(x)
return logits
# Load configuration and tokenizer
model_name = "unsloth/Phi-3-mini-4k-instruct"
config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Initialize the model (without loading weights)
model = MistralForCausalLM(config)
# Get GPU memory and split the model
gpu_memory = get_gpu_memory()
distributed_model = DistributedModel(model, gpu_memory)
# Move tokenizer to the first GPU
device = torch.device('cuda:0')
# Run inference
input_text = "Your input text here"
inputs = tokenizer(input_text, return_tensors="pt")
input_ids = inputs['input_ids'].to(device)
with torch.no_grad():
outputs = distributed_model(input_ids)
# Decode the output
decoded_output = tokenizer.decode(outputs[0].argmax(dim=-1).tolist(), skip_special_tokens=True)
print(decoded_output)
我想知道将来是否会有类似的东西被实现,尤其是由于vLLM从单个GPU的强大扩展能力。
性能回归报告
- 无响应*
关于性能的其他讨论
- 无响应*
您当前的环境(如果您认为有必要的话)
- 无响应*
1条答案
按热度按时间erhoui1w1#
bump