vllm [性能]:在具有不同显存的GPU上拆分模型

idfiyjo8  于 4个月前  发布在  其他
关注(0)|答案(1)|浏览(56)

建议改进性能

我有两个节点(4个和4个),每个节点上都有8个GPU。其中一个节点上有4个3090,另一个节点上有3个3090,还有一个3080。3090的显存为24GB,而3080的显存仅为12GB。因此,当我加载一个大型模型,如llama3 70b时,该模型会拆分,每个GPU占用约16GB,我会收到OOM错误。我们还可以使用另一个稍小一点的模型作为示例,只要它最终拆分后的显存大于12GB即可。
我已经找到了一些解决方法,并认为将其提出来并看看未来是否可以/会不会实现是很有趣的。

使用 accelerate

Accelerate有一个名为get_balanced_memory的实用程序,它在将模型加载到内存进行推理时计算max_memory字典。这会自动计算如何拆分模型,如果有多个具有不同显存量的GPU。也可以手动设置。

手动获取GPU内存

import torch
for i in range(torch.cuda.device_count()):
    print(i, torch.cuda.get_device_properties(i).total_memory)

然后可以使用自定义device_map,或者使用以下加速方法之一:

使用 torch

您可以计算模型中的层数,并使用gpu内存列表(这只是一个MVP,但仅是一个原型)

import torch
import torch.nn as nn
from transformers import AutoConfig, AutoTokenizer
from transformers.models.mistral.modeling_mistral import MistralForCausalLM

# Step 1: Calculate the memory of each GPU
def get_gpu_memory():
    gpu_memory = []
    num_gpus = torch.cuda.device_count()
    for i in range(num_gpus):
        props = torch.cuda.get_device_properties(i)
        gpu_memory.append(props.total_memory)
    return gpu_memory

# Step 2: Manually split the model layers across GPUs
class DistributedModel(nn.Module):
    def __init__(self, model, gpu_memory):
        super(DistributedModel, self).__init__()
        self.gpu_layers = nn.ModuleList()
        total_memory = sum(gpu_memory)
        proportions = [mem / total_memory for mem in gpu_memory]
        
        layers = list(model.model.layers.children())
        num_layers = len(layers)
        layers_per_gpu = [int(p * num_layers) for p in proportions]
        
        # Adjust to make sure the total layers assigned equals num_layers
        diff = num_layers - sum(layers_per_gpu)
        for i in range(diff):
            layers_per_gpu[i % len(layers_per_gpu)] += 1
        
        # Allocate layers to GPUs
        current_layer = 0
        for i, num in enumerate(layers_per_gpu):
            device = torch.device(f'cuda:{i}')
            gpu_layers = layers[current_layer:current_layer + num]
            self.gpu_layers.append(nn.Sequential(*gpu_layers).to(device))
            current_layer += num
        
        self.embedding = model.model.embed_tokens.to(torch.device('cuda:0'))
        self.ln_f = model.model.norm.to(torch.device(f'cuda:{len(gpu_memory) - 1}')))
        self.head = model.lm_head.to(torch.device(f'cuda:{len(gpu_memory) - 1}'))

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        for i, layers in enumerate(self.gpu_layers):
            device = torch.device(f'cuda:{i}')
            x = x.to(device)
            for layer in layers:
                x = layer(x)
                if isinstance(x, tuple):
                    x = x[0]  # Ensure we are working with tensors, not tuples
        x = self.ln_f(x.to(torch.device(f'cuda:{len(self.gpu_layers) - 1}')))
        logits = self.head(x)
        return logits

# Load configuration and tokenizer
model_name = "unsloth/Phi-3-mini-4k-instruct"
config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize the model (without loading weights)
model = MistralForCausalLM(config)

# Get GPU memory and split the model
gpu_memory = get_gpu_memory()
distributed_model = DistributedModel(model, gpu_memory)

# Move tokenizer to the first GPU
device = torch.device('cuda:0')

# Run inference
input_text = "Your input text here"
inputs = tokenizer(input_text, return_tensors="pt")
input_ids = inputs['input_ids'].to(device)

with torch.no_grad():
    outputs = distributed_model(input_ids)

# Decode the output
decoded_output = tokenizer.decode(outputs[0].argmax(dim=-1).tolist(), skip_special_tokens=True)
print(decoded_output)

我想知道将来是否会有类似的东西被实现,尤其是由于vLLM从单个GPU的强大扩展能力。

性能回归报告

  • 无响应*

关于性能的其他讨论

  • 无响应*

您当前的环境(如果您认为有必要的话)

  • 无响应*

相关问题