[Bug]: 使用response_format json两次发送请求会破坏vLLM,

0s7z1bwu  于 4个月前  发布在  其他
关注(0)|答案(9)|浏览(121)

当前环境

Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.133+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L4
Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             8
On-line CPU(s) list:                0-7
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family:                         6
Model:                              85
Thread(s) per core:                 2
Core(s) per socket:                 4
Socket(s):                          1
Stepping:                           7
BogoMIPS:                           4400.44
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          128 KiB (4 instances)
L1i cache:                          128 KiB (4 instances)
L2 cache:                           4 MiB (4 instances)
L3 cache:                           38.5 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-7
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT Host state unknown

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
�[4mGPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID�[0m
GPU0	 X 	0-7	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 描述问题

vLLM 进入损坏状态,在发送特定 response_format = json 请求后只返回垃圾数据。第一个请求 vllm 能够返回一个相对合理的响应,但一旦重复相同的请求,它只会开始返回 \n\t\t\t\t...,其中 \t 会重复直到达到 max_tokens。

重现步骤:

  1. 部署 vLLM v0.4.0-post1,使用与 openai 兼容的 API 端点和 mistral v0.2 指令来自 HF。模型 ID:mistralai/Mistral-7B-Instruct-v0.2。使用的配置如下:
gpuMemoryUtilization: "0.90"
maxModelLen: 16752

此示例运行在一个单 L4 GPU 上。

  1. 发送以下请求两次
curl http://localhost:8080/v1/completions  -H "Content-Type: application/json" -d @request-body-sanitized.json

当前结果:

  1. 第一次请求的结果将显示如下内容:
{"id":"cmpl-e7eef8ba397b4dab8c396a0611773178","object":"text_completion","created":1713118051,"model":"mistral-7b-instruct-v0.2","choices":[{"index":0,"text":"\n{\n   \t\"isvalid\"\t:\ttrue,\n   \t\"summary\"\t:\t\"A large l
anguage model (LLM) is a type of AI model that can generate and process natural language. It learns from text do
cuments and can be used for tasks like text generation and classification. LLMs can predict the next word or tok
en in a text input.\"\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

\n 将重复直到达到最大令牌数

  1. 第二次请求将显示仅包含 \n\t\t\t\t 的响应,其中 \t 将重复直到达到最大令牌数。
    偶尔 vLLM 会进入错误状态,所有请求都返回错误,但我无法始终进入该状态。当出现这种情况时,可以看到以下错误:
lark.exceptions.UnexpectedToken: Unexpected token Token('LBRACE', '{') at line 11, column 2.                    
Expected one of:                                                                                                
        * RBRACE                                                                                                
        * UNESCAPED_STRING                                                                                      
Previous tokens: [Token('LBRACE', '{')] 
lark.exceptions.UnexpectedCharacters: No terminal matches '{' in the current parser context, at line 11 col 2   
                                                                                                               
{{{{{{{{{{

这个问题最初在 Lingo 中报告为 substratusai/lingo#96,但似乎是由于 vLLM 本身的问题。

预期结果

vLLM 不应当因为使用 response_format = json 而进入损坏状态,导致后续响应无法提供任何结果。

mo49yndu

mo49yndu1#

request-body-sanitized.json
Attaching the request body that was used to reproduce

wa7juj8i

wa7juj8i2#

看起来关于大纲的有限状态机(FSM)复制和初始化存在一些问题。对于这种情况,使用lm-format-enforcer可能会更好。

6jjcrrmo

6jjcrrmo3#

似乎存在一个错误:

response_format = { 
   "type": "json_object" 
},
  • 示例客户端:
from openai import OpenAI
import os

def prompt_json_completion(messages):    
    base_url = os.getenv("BASE_URL", "http://localhost:8000/v1")
    api_key = os.getenv("API_KEY", "EMPTY")
    max_tokens = os.getenv("MAX_TOKENS", 100)

    client = OpenAI(api_key = api_key, base_url = base_url)
    completion = client.chat.completions.create(
        model = client.models.list().data[0].id,            
        # response_format = { 
        #     "type": "json_object" 
        # },
        messages = messages,
        max_tokens = max_tokens,
    )
    #print(completion)
    print(completion.choices[0].message.content)

if __name__ == "__main__":
    user_prompt = "Generate example JSON data of a student in an SIS"
    messages = [
                {"role": "user", "content": user_prompt}
            ]
    prompt_json_completion(messages=messages)

如果我取消注解 response_format,我将得到所有的空白字符。

7jmck4yq

7jmck4yq4#

我遇到了与json_object相同的错误,有人在之前的版本中遇到过这个错误吗?

6yoyoihd

6yoyoihd5#

当将"response_format"设置为{"type": "json_object"}时,文本生成在达到最大模型长度时停止。当将"response_format"设置为{"type": "text"}时,一切正常。
模型:Mistral-7B-Instruct-v0.2-Function-Calling
vllm: 0.4.1

mftmpeh8

mftmpeh86#

大纲已经对其json输出进行了一些改进,并之前修复为outlines==0.0.34。这些问题可能已经在夜间修复:
vllm/requirements-common.txt
第20行在abe855d
| | outlines >= 0.0.43 # 需要torch >= 2.1.0 |

mf98qq94

mf98qq947#

我认为PR #4109已合并到主修复中,解决了这个问题。(@br3no)

xdnvmnnf

xdnvmnnf8#

我最近尝试了v0.5.0.post1和vLLM+大纲,仍然存在生成\t\n的重复问题,直到max_length指定{"type": "json_object"}

u2nhd7ah

u2nhd7ah9#

是的,我也遇到了相同的问题。
vllm 0.4.2 vllm-nccl-cu12 2.18.1.0.4.0
很好奇这个问题的原因和任何解决方法?肯定会感激任何建议。

相关问题