text-generation-inference 当在提示中提供大于1MB的base64编码图像时,出现错误"Failed to buffer the request body: length limit exceeded",

2nc8po8w  于 6个月前  发布在  其他
关注(0)|答案(7)|浏览(166)

系统信息

text-generation-launcher --env

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: 2d0a7173d4891e7cd5f9b77f8e0987b82a339e51
Docker label: sha-2d0a717
nvidia-smi:
Wed Apr 24 19:58:49 2024
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
   |  0%   23C    P8             16W /  350W |    8450MiB /  24576MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   1  NVIDIA GeForce RTX 3090        On  |   00000000:21:00.0 Off |                  N/A |
   |  0%   25C    P8             20W /  350W |    8418MiB /  24576MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+

模型信息

{
  "model_id": "/opt/ml/checkpoint/llava-v1.6-mistral-7b-hf",
  "model_sha": null,
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": null,
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 24576,
  "max_total_tokens": 32768,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 65536,
  "max_waiting_tokens": 20,
  "max_batch_size": null,
  "validation_workers": 2,
  "max_client_batch_size": 4,
  "version": "2.0.1",
  "sha": "2d0a7173d4891e7cd5f9b77f8e0987b82a339e51",
  "docker_label": "sha-2d0a717"
}

信息

  • Docker
  • CLI直接使用

任务

  • 一个官方支持的命令
  • 我自己的修改

重现方法

使用大于1MB的镜像,并适当设置IMAGE_PATH和API_ENDPOINT:

from PIL import Image
import requests
import base64
from io import BytesIO

# fetch image
image = Image.open(IMAGE_PATH)

# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="PNG")  # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')

# format image string 
image_string = f"data:image/png;base64,{base64_image}"
query = "Describe the image?"
prompt=f"[INST] ![]({image_string})\n{query} [/INST]"

headers = {
	"Accept" : "application/json",
	"Content-Type": "application/json" 
}

payload = {"inputs":prompt}
response = requests.post(f"{API_ENDPOINT}/generate", headers=headers, json=payload)
try:
    print(response.json())
except:
    print(response.text)

这将打印: Failed to buffer the request body: length limit exceeded
如果使用的镜像小于1MB,它可以正确生成。

预期行为

它应该为图像生成文本,只要它适合模型的上下文。根据错误文本来看,它似乎与default body size in Axum有关,基于与tokio-rs/axum#1652的相似性。

lg40wkob

lg40wkob1#

可能与这个 #1777 有关。

vlju58qv

vlju58qv2#

我也遇到过这个问题,但是在idefics-9b-instruct模型上也出现了长度限制超出的情况。该模型可以处理不同维度的图像,但当图像较大(超过1MB)时仍然会失败。

gr8qqesn

gr8qqesn3#

这个问题已经过期,因为它已经打开了30天,没有活动。请移除过期标签或评论,否则将在5天内关闭。

v7pvogib

v7pvogib4#

我将在不久的将来重新验证最新的TGI版本。

pjngdqdw

pjngdqdw5#

我再次尝试了这个,使用了最新的版本和idefics-8b-chatty模型,而不是llava模型,问题仍然存在。

$x_1^a_0 b_1 x$

模型信息

$x_1^a_1 b_1 x$

jutyujz0

jutyujz06#

这个问题已经过期,因为它已经打开了30天,没有活动。请移除过期标签或评论,否则将在5天内关闭。

e1xvtsh3

e1xvtsh37#

我尝试在最新的TGI版本(2.2)上复现这个问题,但最后得到了一个不同的错误:

{"timestamp":"2024-07-25T17:50:30.156102Z","level":"ERROR","message":"Server error: 'Tensor' object has no attribute 'input_lengths'","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":46,"span":{"size":1,"name":"decode"},"spans":[{"batch_size":1,"name":"batch"},{"name":"decode"},{"size":1,"name":"decode"},{"size":1,"name":"decode"}]}
{"timestamp":"2024-07-25T17:50:30.149213Z","level":"ERROR","fields":{"message":"Method Decode encountered an error.\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 309, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 723, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 193, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 692, in wrapper\n return callback(**use_params)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 118, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 297, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n File \"/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py\", line 165, in invoke_intercept_method\n return await self.intercept(\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py\", line 21, in intercept\n return await response\n File \"/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 120, in _unary_interceptor\n raise error\n File \"/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 111, in _unary_interceptor\n return await behavior(request_or_iterator, context)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 183, in Decode\n generations, next_batch, timings = self.model.generate_token(batch)\n File \"/opt/conda/lib/python3.10/contextlib.py\", line 79, in inner\n return func(*args, **kwds)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 1376, in generate_token\n out, speculative_logits = self.forward(batch, adapter_data)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py\", line 351, in forward\n logits, speculative_logits = self.model.forward(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py\", line 824, in forward\n hidden_states = self.text_model.model(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 447, in forward\n hidden_states, residual = layer(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 372, in forward\n attn_output = self.self_attn(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 235, in forward\n attn_output = paged_attention(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/attention/cuda.py\", line 116, in paged_attention\n input_lengths = seqlen.input_lengths\nAttributeError: 'Tensor' object has no attribute 'input_lengths'"},"target":"text_generation_launcher"}

相关问题