vllm [Bug]: 分块预填充与非分块输出对于长提示的不同之处

ubbxdtey 于 6个月前发布在其他

关注(0)|答案(2)|浏览(49)

描述bug

最小可复现示例(提示很大，所以它在一个单独的txt文件中，如果需要，我很乐意分享它。)

import gc
import os
import torch
from vllm import LLM, SamplingParams

from huggingface_hub import login
login(token=os.environ.get("HF_TOKEN"))

model = "meta-llama/Meta-Llama-3-8B-Instruct"

with open("/workspace/vllm/temp/paxos-paper.txt", "r") as f:
    prompt = f.read()
params = SamplingParams(max_tokens=100, temperature=0.5)

def generate(enable_chunked_prefill):
    llm = LLM(model=model, enable_chunked_prefill=enable_chunked_prefill)
    request_output = llm.generate(prompt, sampling_params=params)[0]
    del llm
    gc.collect()
    torch.cuda.empty_cache()

    out = request_output.outputs[0]
    print(f"\nPrompt length: {len(request_output.prompt_token_ids)} tokens")
    print(f"OUTPUT: ({len(out.token_ids)} tokens)")
    print(out.text, "\n")
    return out.text

nonchunked_text = generate(False)
chunked_text = generate(True)

print("\nNON-CHUNKED:")
print(nonchunked_text)
print("\nCHUNKED:")
print(chunked_text)
print("\nDifferent?", nonchunked_text != chunked_text)

对于输入为3600个tokens的meta-llama/Meta-Llama-3-8B-Instruct,出现了意外的结果。编辑：当切换chunked prefill时，bug也会出现，但我们应该期望没有差异。从我看到的来看，这发生在采样温度范围[0.3, 0.9]内。
运行上述示例的示例输出如下：

NON-CHUNKED:
  What are the main points of the paper?

Here is a summary of the paper:

The paper presents the Paxos algorithm for implementing a fault-tolerant distributed system. The algorithm is a consensus algorithm that ensures that a single value is chosen from a set of proposed values, even in the presence of failures and message loss.

The paper starts by defining the problem of choosing a value in a distributed system, and presents the safety requirements for a consensus algorithm. The algorithm is presented in three phases:

CHUNKED:
  What are the main points and what are the contributions of this paper?  What are the limitations and assumptions of this paper?  What are the implications of this paper for the field of distributed systems?

The paper presents a consensus algorithm for distributed systems, known as Paxos, which is designed to ensure that a single value is chosen from a set of proposed values. The algorithm is based on a state machine approach, where a set of processes (proposers, acceptors, and learners)

Different? True

vllm

来源：https://github.com/vllm-project/vllm/issues/5952

2条答案

按热度按时间

wydwbb8l1#

如果可以的话，请帮忙建议一些流程。

赞(0）回复(0）举报 6个月前

yacmzcpb2#

如果有不同的意见，但我认为除非质量基准有很大的差异，否则这种输出差异很难被称为一个bug。分块预填充使用与常规内核不同的内核，并且这个测试没有使用温度==0, float 32 dtype,也没有贪婪采样。分块预填充的输出对我来说似乎也相对合理。你运行过像MMLU这样的基准测试，看到质量差异很大吗？

赞(0）回复(0）举报 6个月前

我来回答

vllm [Bug]: 分块预填充与非分块输出对于长提示的不同之处

描述bug

2条答案

相关问题

热门标签

最新问答