问题:在运行测试以比较不同模型输出时,遇到了计算机冻结的问题。使用了一些工具,如promptfoo或langfuse(通过Haystack或Langchain)。在这些工具中,你设置了一个模型列表,然后程序会依次调用Ollama来加载这些模型。我使用的是一台搭载Ubuntu 22.04和RTX3090的Linux电脑。
在程序加载一些较大的模型(fp16、command-r或gemma2:27B)后,当它加载下一个模型时,我的电脑就会冻结。我失去了对机器的所有访问,SSH也无法再响应。要解决这个问题,我不得不强制重启它,按物理按钮。
以下是在崩溃加载mistrial-nemo:12b(但这也可能发生在其他模型上,所以这可能不是特定于模型的问题):
Cache is disabled.
Providers are running in serial with user input.
Running 1 evaluations for provider ollama:chat:command-r:latest with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:command-r:latest "Eres un as" lista_ingr
Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:gemma2:27b-instruct-q5_K_M with concurrency=4...
[████████████████████████████████████████] 100% | ETA: 0s | 1/1 | ollama:chat:gemma2:27b-instruct-q5_K_M "Eres un as" lista_ingr
Ready to continue to the next provider? (Y/n)
Running 1 evaluations for provider ollama:chat:gemma2:2b-instruct-q8_0 with concurrency=4...
[░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0% | ETA: 0s | 0/1 | ""
Mistral-Nemo从未成功加载,电脑也没有响应。本地控制台也冻结了(我有Psensor显示实时指标和...它们相当正常且冻结了)。
使用除Ollama之外的其他后端并未导致类似的崩溃,但它们不太方便使用,所以我想调试这个问题,找出是什么导致操作系统崩溃。
检查日志,我没有看到与崩溃或新模型加载相关的信息:
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.90.07 Fri May 31 09:35:42 UTC 2024
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model ftype = Q4_0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model params = 7.24 B
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: general.name = mayflowergmbh
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: BOS token = 1 '<s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOS token = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: UNK token = 0 '<unk>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: PAD token = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: LF token = 13 '<0x0A>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOT token = 32001 '<|im_end|>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: max token length = 48
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: found 1 CUDA devices:
ago 19 10:47:19 ananke ollama[2292]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
ago 19 10:47:20 ananke ollama[2292]: llm_load_tensors: ggml ctx size = 0.27 MiB
ago 19 10:47:20 ananke ollama[2292]: time=2024-08-19T10:47:20.104+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading 32 repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading non-repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloaded 33/33 layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CPU buffer size = 70.32 MiB
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CUDA0 buffer size = 3847.56 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ctx = 65536
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_batch = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ubatch = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: flash_attn = 0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_base = 10000.0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_scale = 1
ago 19 10:47:23 ananke ollama[2292]: llama_kv_cache_init: CUDA0 KV buffer size = 8192.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: KV self size = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host output buffer size = 0.55 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA0 compute buffer size = 4256.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host compute buffer size = 136.01 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph nodes = 1030
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph splits = 2
ago 19 10:47:23 ananke ollama[8085]: INFO [main] model loaded | tid="130282049744896" timestamp=1724057243
ago 19 10:47:23 ananke ollama[2292]: time=2024-08-19T10:47:23.622+02:00 level=INFO source=server.go:632 msg="llama runner started in 3.77 seconds"
ago 19 10:47:26 ananke ollama[2292]: [GIN] 2024/08/19 - 10:47:26 | 200 | 6.932276921s | 127.0.0.1 | POST "/api/chat"
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: f_logit_scale = 0.0e+00
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_ff = 14336
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_expert = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_expert_used = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: causal attn = 1
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: pooling type = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope type = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope scaling = linear
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: freq_base_train = 10000.0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: freq_scale_train = 1
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: n_ctx_orig_yarn = 32768
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: rope_finetuned = unknown
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_conv = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_inner = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_d_state = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: ssm_dt_rank = 0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model type = 7B
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model ftype = Q4_0
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model params = 7.24 B
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: general.name = mayflowergmbh
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: BOS token = 1 '<s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOS token = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: UNK token = 0 '<unk>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: PAD token = 2 '</s>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: LF token = 13 '<0x0A>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: EOT token = 32001 '<|im_end|>'
ago 19 10:47:19 ananke ollama[2292]: llm_load_print_meta: max token length = 48
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ago 19 10:47:19 ananke ollama[2292]: ggml_cuda_init: found 1 CUDA devices:
ago 19 10:47:19 ananke ollama[2292]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
ago 19 10:47:20 ananke ollama[2292]: llm_load_tensors: ggml ctx size = 0.27 MiB
ago 19 10:47:20 ananke ollama[2292]: time=2024-08-19T10:47:20.104+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading 32 repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloading non-repeating layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: offloaded 33/33 layers to GPU
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CPU buffer size = 70.32 MiB
ago 19 10:47:23 ananke ollama[2292]: llm_load_tensors: CUDA0 buffer size = 3847.56 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ctx = 65536
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_batch = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: n_ubatch = 512
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: flash_attn = 0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_base = 10000.0
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: freq_scale = 1
ago 19 10:47:23 ananke ollama[2292]: llama_kv_cache_init: CUDA0 KV buffer size = 8192.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: KV self size = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host output buffer size = 0.55 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA0 compute buffer size = 4256.00 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: CUDA_Host compute buffer size = 136.01 MiB
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph nodes = 1030
ago 19 10:47:23 ananke ollama[2292]: llama_new_context_with_model: graph splits = 2
ago 19 10:47:23 ananke ollama[8085]: INFO [main] model loaded | tid="130282049744896" timestamp=1724057243
ago 19 10:47:23 ananke ollama[2292]: time=2024-08-19T10:47:23.622+02:00 level=INFO source=server.go:632 msg="llama runner started in 3.77 seconds"
ago 19 10:47:26 ananke ollama[2292]: [GIN] 2024/08/19 - 10:47:26 | 200 | 6.932276921s | 127.0.0.1 | POST "/api/chat"
ago 19 10:47:27 ananke ollama[2292]: time=2024-08-19T10:47:27.959+02:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-16d65981-a438-ac56-8bab-f1393824041b library=cuda total="23.7 GiB" available="6.9 GiB"
任何帮助解决此问题将不胜感激!
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama版本
0.3.6
3条答案
按热度按时间yyyllmsg1#
整个系统的冻结听起来很像RAM溢出。由于您使用大型模型,如Command R和Mistral Nemo,它将从GPU溢出到CPU(RAM)(您的GPU型号是没问题的)。但是,由于您似乎还使用了巨大的上下文大小(64k),您可能会耗尽RAM。或者离它太近了。您是否监控了受影响系统的RAM使用情况?您是否启用了交换并设置为合理的大小(例如8GB),以便它可以溢出而不是崩溃您的系统?
我经常使用Open-Webui与3-10个模型同时进行比较(顺序),这会导致大量的重新加载,从未遇到过任何问题。
nr9pn0ug2#
日志似乎不完整,应该有像
time=2024-08-19T10:30:24.515Z level=INFO source=memory.go:309 msg="offload to cuda"
这样的行,这有助于调查正在发生的事情。你能发布完整的日志吗?根据
llm_load_tensors: offloaded 33/33 layers to GPU
,似乎 mayflowergmbh 已经完全加载到 GPU 上。mistral-nemo:12b-instruct-2407-q8_0 有 41 层,这可能或可能不会适合 RTX 3090 的 24G(如果其他模型也在驻留)。没有完整的日志,很难说。vfhzx4xs3#
感谢@MaxJa4提出的监控RAM的想法。我总是选择适合卡片的VRAM的模型,但考虑到上下文大小,它可能会将一些卸载到RAM中。我这台机器有64GB的RAM,我正在使用的文本不是太大,但是有几个模型...我会检查一下这个。
关于完整的日志@rick-github,你是指仅仅是Ollama服务的完整日志吗?我已经附上了与上一次运行相对应的一个。
ollama_error_log.txt
好消息是我认为它与某种驱动程序不匹配有关。我刚刚更新了机器上的所有软件包(使用Lambda堆栈仓库确保一切都一致),删除了所有未使用的软件包,并能够在没有崩溃的情况下运行20多个模型测试。我需要检查日志以找出一些不起作用的组合和一些起作用的组合。