尝试运行Shieldgemma模型。
架构是Gemma2ForCausalLM,应该已经支持了。配置文件指定了transformers版本为4.42.4。
我已经安装了以下内容:
pip list | grep "vllm\|flash"
flash-attn 2.0.4
flashinfer 0.1.3+cu124torch2.4
vllm 0.5.3.post1
vllm-flash-attn 2.5.9.post1
我还安装了Transformers 4.43.3。
在检查配置文件后,发现配置文件中指定的是hidden_activation
,而不是hidden_act
。在config.json文件中手动更改后,我得到了一个错误,指出我应该使用flashinfer后端。VLLM_ATTENTION_BACKEND=FLASHINFER
在此之后,发生了以下错误:
INFO 08-02 17:46:35 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', speculative_config=None, tokenizer='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-02 17:46:36 selector.py:80] Using Flashinfer backend.
INFO 08-02 17:46:36 model_runner.py:680] Starting to load model /modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d...
INFO 08-02 17:46:36 selector.py:80] Using Flashinfer backend.
2024-08-02 17:46:37 | ERROR | stderr | Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
2024-08-02 17:46:37 | ERROR | stderr |
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.41s/it]
2024-08-02 17:46:38 | ERROR | stderr |
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.31it/s]
2024-08-02 17:46:38 | ERROR | stderr |
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.17it/s]
2024-08-02 17:46:38 | ERROR | stderr |
2024-08-02 17:46:38 | ERROR | stderr |
INFO 08-02 17:46:38 model_runner.py:692] Loading model weights took 4.9975 GB
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: Traceback (most recent call last):
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/project/vllm_worker.py", line 236, in <module>
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: engine = cls(
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: self.engine = self._init_engine(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: return engine_class(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: self._initialize_kv_caches()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: self.model_executor.determine_num_available_blocks())
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: return self.driver_worker.determine_num_available_blocks()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: self.model_runner.profile_run()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1272, in execute_model
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: BatchDecodeWithPagedKVCacheWrapper(
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: TypeError: 'NoneType' object is not callable
1条答案
按热度按时间bvhaajcl1#
错误意味着您没有安装FlashInfer。请按照此处分享的步骤操作。