模型训练报错paddle.fluid.core_noavx has no attribute 'c_broadcast'

rta7y2nd  于 5个月前  发布在  其他
关注(0)|答案(8)|浏览(65)

请提出你的问题 Please ask your question

环境:硬件环境为ARMv8架构cpu机器(无GPU),使用容器启动,ubuntu18.04为基础镜像,python为3.7.13,容器中创建虚拟环境,并按要求安装pip依赖,安装的是paddle 2.3版。

现象:编译过程无报错问题,安装无问题,单核训练无问题,使用paddlenlp中的分布式训练命令报错。

python分布式训练命令:python -m paddle.distributed.launch --nproc_per_node=2 --backend='gloo' xxxx.py

报错信息:paddle.fluid.core_noavx has no attribute 'c_broadcast'

当前paddle编译过程命令如下:
git clone https://github.com/PaddlePaddle/Paddle.git

cd Paddle

git checkout release/2.3

mkdir build && cd build

ulimit -n 4096

export PADDLE_VERSION=2.3.0

cmake .. -DPY_VERSION=3.7.13 -DPYTHON_EXECUTABLE= which python3 -DWITH_ARM=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON -DWITH_XBYAK=OFF -DPYTHON_INCLUDE_DIR=$(python3 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") -DPYTHON_LIBRARY=$(python3 -c "import distutils.sysconfig as sysconfig; print(sysconfig.get_config_var('LIBDIR'))") -DWITH_GLOO=ON

make TARGET=ARMV8 -j$(nproc)

4ioopgfo

4ioopgfo1#

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看 官网API文档常见问题历史IssueAI社区 来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

ctrmrzij

ctrmrzij2#

你好,请问是 这个任务 吗,使用更新一些的paddle版本是否仍然会报错?

slwdgvem

slwdgvem3#

你好,请问是 这个任务 吗,使用更新一些的paddle版本是否仍然会报错?

是这个任务: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/information_extraction/text,
目前2.4新版本的paddle还在编译中,现在使用的是armv8编译的cpu的2.3版,训练启动的时候命令为python3 -m paddle.distributed.launch --nproc_per_node=8 --backend='gloo' finetune.py

weylhg0b

weylhg0b4#

你好,请问是 这个任务 吗,使用更新一些的paddle版本是否仍然会报错?

是这个任务: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/information_extraction/text, 目前2.4新版本的paddle还在编译中,现在使用的是armv8编译的cpu的2.3版,训练启动的时候命令为python3 -m paddle.distributed.launch --nproc_per_node=8 --backend='gloo' finetune.py

在编译2.3版本的时候增加了--DWITH_DISTRIBUTE=ON,整体cmake命令如下:
cmake .. -DPY_VERSION=3.7.13 -DPYTHON_EXECUTABLE=which python3 -DWITH_ARM=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DON_INFER=ON -DWITH_XBYAK=OFF -DPYTHON_INCLUDE_DIR=$(python3 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") -DPYTHON_LIBRARY=$(python3 -c "import distutils.sysconfig as sysconfig; print(sysconfig.get_config_var('LIBDIR'))") -DWITH_GLOO=ON

现在运行起来后报错,报错信息如下:
ERROR 2023-02-27 16:07:13,704 launch_utils.py:642] ABORT!!! Out of all 8 trainers, the trainer process with rank=[1, 6, 7] was aborted. Please check its log.

3xiyfsfu

3xiyfsfu5#

@Macxy2018 辛苦查看下worker对应id1,6,7的日志文件,看看具体报错原因呢

5hcedyr0

5hcedyr06#

整体上么有报错提示信息……,worker6和7的都是一样的没有报错提示,worker1的日志如下,然后就停了:
/venv/paddle/lib/python3.7/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
�[33m[2023-02-27 10:48:44,788] [ WARNING]�[0m - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.�[0m
�[32m[2023-02-27 10:48:44,789] [ INFO]�[0m - The default value for the training argument --report_to will change in v5 (from all installed integrations to none). In v5, you will need to use --report_to all to get the same behavior as now. You should start updating your code and make this info disappear :-).�[0m
�[32m[2023-02-27 10:48:44,789] [ INFO]�[0m - ============================================================�[0m
�[32m[2023-02-27 10:48:44,789] [ INFO]�[0m - Model Configuration Arguments �[0m
�[32m[2023-02-27 10:48:44,790] [ INFO]�[0m - paddle commit id :a5875319fe3bdd359895f1f6a11faf21df886f88�[0m
�[32m[2023-02-27 10:48:44,790] [ INFO]�[0m - export_model_dir :./checkpoint_base_1/model_best�[0m
�[32m[2023-02-27 10:48:44,790] [ INFO]�[0m - model_name_or_path :uie-base�[0m
�[32m[2023-02-27 10:48:44,790] [ INFO]�[0m - multilingual :False�[0m
�[32m[2023-02-27 10:48:44,790] [ INFO]�[0m - �[0m
�[32m[2023-02-27 10:48:44,790] [ INFO]�[0m - ============================================================�[0m
�[32m[2023-02-27 10:48:44,790] [ INFO]�[0m - Data Configuration Arguments �[0m
�[32m[2023-02-27 10:48:44,791] [ INFO]�[0m - paddle commit id :a5875319fe3bdd359895f1f6a11faf21df886f88�[0m
�[32m[2023-02-27 10:48:44,791] [ INFO]�[0m - dev_path :data/dev.txt�[0m
�[32m[2023-02-27 10:48:44,791] [ INFO]�[0m - max_seq_length :512�[0m
�[32m[2023-02-27 10:48:44,791] [ INFO]�[0m - train_path :data/train.txt�[0m
�[32m[2023-02-27 10:48:44,791] [ INFO]�[0m - �[0m
�[33m[2023-02-27 10:48:46,324] [ WARNING]�[0m - Process rank: 1, device: cpu, world_size: 8, distributed training: True, 16-bits training: False�[0m
�[32m[2023-02-27 10:48:46,324] [ INFO]�[0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-base'.�[0m
�[32m[2023-02-27 10:48:46,325] [ INFO]�[0m - Already cached /root/.paddlenlp/models/uie-base/ernie_3.0_base_zh_vocab.txt�[0m
�[32m[2023-02-27 10:48:46,360] [ INFO]�[0m - tokenizer config file saved in /root/.paddlenlp/models/uie-base/tokenizer_config.json�[0m
�[32m[2023-02-27 10:48:46,360] [ INFO]�[0m - Special tokens file saved in /root/.paddlenlp/models/uie-base/special_tokens_map.json�[0m
�[32m[2023-02-27 10:48:46,362] [ INFO]�[0m - Model config ErnieConfig {
"attention_probs_dropout_prob": 0.1,
"enable_recompute": false,
"fuse": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 2048,
"model_type": "ernie",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"paddlenlp_version": null,
"pool_act": "tanh",
"task_id": 0,
"task_type_vocab_size": 3,
"type_vocab_size": 4,
"use_task_id": true,
"vocab_size": 40000
}
�[0m
�[32m[2023-02-27 10:48:57,758] [ INFO]�[0m - All model checkpoint weights were used when initializing UIE.
�[0m
�[32m[2023-02-27 10:48:57,759] [ INFO]�[0m - All the weights of UIE were initialized from the model checkpoint at uie-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.�[0m
�[32m[2023-02-27 10:48:57,818] [ INFO]�[0m - ============================================================�[0m
�[32m[2023-02-27 10:48:57,819] [ INFO]�[0m - Training Configuration Arguments �[0m
�[32m[2023-02-27 10:48:57,819] [ INFO]�[0m - paddle commit id :a5875319fe3bdd359895f1f6a11faf21df886f88�[0m
�[32m[2023-02-27 10:48:57,819] [ INFO]�[0m - _no_sync_in_gradient_accumulation:True�[0m
�[32m[2023-02-27 10:48:57,819] [ INFO]�[0m - activation_quantize_type :None�[0m
�[32m[2023-02-27 10:48:57,820] [ INFO]�[0m - adam_beta1 :0.9�[0m
�[32m[2023-02-27 10:48:57,820] [ INFO]�[0m - adam_beta2 :0.999�[0m
�[32m[2023-02-27 10:48:57,820] [ INFO]�[0m - adam_epsilon :1e-08�[0m
�[32m[2023-02-27 10:48:57,820] [ INFO]�[0m - algo_list :None�[0m
�[32m[2023-02-27 10:48:57,820] [ INFO]�[0m - batch_num_list :None�[0m
�[32m[2023-02-27 10:48:57,820] [ INFO]�[0m - batch_size_list :None�[0m
�[32m[2023-02-27 10:48:57,820] [ INFO]�[0m - bf16 :False�[0m
�[32m[2023-02-27 10:48:57,821] [ INFO]�[0m - bf16_full_eval :False�[0m
�[32m[2023-02-27 10:48:57,821] [ INFO]�[0m - bias_correction :False�[0m
�[32m[2023-02-27 10:48:57,821] [ INFO]�[0m - current_device :cpu�[0m
�[32m[2023-02-27 10:48:57,821] [ INFO]�[0m - dataloader_drop_last :False�[0m
�[32m[2023-02-27 10:48:57,821] [ INFO]�[0m - dataloader_num_workers :0�[0m
�[32m[2023-02-27 10:48:57,821] [ INFO]�[0m - device :cpu�[0m
�[32m[2023-02-27 10:48:57,821] [ INFO]�[0m - disable_tqdm :True�[0m
�[32m[2023-02-27 10:48:57,822] [ INFO]�[0m - do_compress :False�[0m
�[32m[2023-02-27 10:48:57,822] [ INFO]�[0m - do_eval :True�[0m
�[32m[2023-02-27 10:48:57,822] [ INFO]�[0m - do_export :True�[0m
�[32m[2023-02-27 10:48:57,822] [ INFO]�[0m - do_predict :False�[0m
�[32m[2023-02-27 10:48:57,822] [ INFO]�[0m - do_train :True�[0m
�[32m[2023-02-27 10:48:57,822] [ INFO]�[0m - eval_batch_size :8�[0m
�[32m[2023-02-27 10:48:57,822] [ INFO]�[0m - eval_steps :100�[0m
�[32m[2023-02-27 10:48:57,822] [ INFO]�[0m - evaluation_strategy :IntervalStrategy.STEPS�[0m
�[32m[2023-02-27 10:48:57,823] [ INFO]�[0m - flatten_param_grads :False�[0m
�[32m[2023-02-27 10:48:57,823] [ INFO]�[0m - fp16 :False�[0m
�[32m[2023-02-27 10:48:57,823] [ INFO]�[0m - fp16_full_eval :False�[0m
�[32m[2023-02-27 10:48:57,823] [ INFO]�[0m - fp16_opt_level :O1�[0m
�[32m[2023-02-27 10:48:57,823] [ INFO]�[0m - gradient_accumulation_steps :1�[0m
�[32m[2023-02-27 10:48:57,823] [ INFO]�[0m - greater_is_better :True�[0m
�[32m[2023-02-27 10:48:57,823] [ INFO]�[0m - ignore_data_skip :False�[0m
�[32m[2023-02-27 10:48:57,824] [ INFO]�[0m - input_dtype :int64�[0m
�[32m[2023-02-27 10:48:57,824] [ INFO]�[0m - input_infer_model_path :None�[0m
�[32m[2023-02-27 10:48:57,824] [ INFO]�[0m - label_names :['start_positions', 'end_positions']�[0m
�[32m[2023-02-27 10:48:57,824] [ INFO]�[0m - learning_rate :1e-05�[0m
�[32m[2023-02-27 10:48:57,824] [ INFO]�[0m - load_best_model_at_end :True�[0m
�[32m[2023-02-27 10:48:57,824] [ INFO]�[0m - local_process_index :1�[0m
�[32m[2023-02-27 10:48:57,824] [ INFO]�[0m - local_rank :1�[0m
�[32m[2023-02-27 10:48:57,825] [ INFO]�[0m - log_level :-1�[0m
�[32m[2023-02-27 10:48:57,825] [ INFO]�[0m - log_level_replica :-1�[0m
�[32m[2023-02-27 10:48:57,825] [ INFO]�[0m - log_on_each_node :True�[0m
�[32m[2023-02-27 10:48:57,825] [ INFO]�[0m - logging_dir :./checkpoint_base_1/model_best/runs/Feb27_10-48-44_b11d0c49d963�[0m
�[32m[2023-02-27 10:48:57,825] [ INFO]�[0m - logging_first_step :False�[0m
�[32m[2023-02-27 10:48:57,825] [ INFO]�[0m - logging_steps :10�[0m
�[32m[2023-02-27 10:48:57,825] [ INFO]�[0m - logging_strategy :IntervalStrategy.STEPS�[0m
�[32m[2023-02-27 10:48:57,825] [ INFO]�[0m - lr_scheduler_type :SchedulerType.LINEAR�[0m
�[32m[2023-02-27 10:48:57,826] [ INFO]�[0m - max_grad_norm :1.0�[0m
�[32m[2023-02-27 10:48:57,826] [ INFO]�[0m - max_steps :-1�[0m
�[32m[2023-02-27 10:48:57,826] [ INFO]�[0m - metric_for_best_model :eval_f1�[0m
�[32m[2023-02-27 10:48:57,826] [ INFO]�[0m - minimum_eval_times :None�[0m
�[32m[2023-02-27 10:48:57,826] [ INFO]�[0m - moving_rate :0.9�[0m
�[32m[2023-02-27 10:48:57,826] [ INFO]�[0m - no_cuda :False�[0m
�[32m[2023-02-27 10:48:57,827] [ INFO]�[0m - num_train_epochs :100.0�[0m
�[32m[2023-02-27 10:48:57,827] [ INFO]�[0m - onnx_format :True�[0m
�[32m[2023-02-27 10:48:57,827] [ INFO]�[0m - optim :OptimizerNames.ADAMW�[0m
�[32m[2023-02-27 10:48:57,827] [ INFO]�[0m - output_dir :./checkpoint_base_1/model_best�[0m
�[32m[2023-02-27 10:48:57,827] [ INFO]�[0m - overwrite_output_dir :True�[0m
�[32m[2023-02-27 10:48:57,827] [ INFO]�[0m - past_index :-1�[0m
�[32m[2023-02-27 10:48:57,827] [ INFO]�[0m - per_device_eval_batch_size :8�[0m
�[32m[2023-02-27 10:48:57,828] [ INFO]�[0m - per_device_train_batch_size :8�[0m
�[32m[2023-02-27 10:48:57,828] [ INFO]�[0m - prediction_loss_only :False�[0m
�[32m[2023-02-27 10:48:57,828] [ INFO]�[0m - process_index :1�[0m
�[32m[2023-02-27 10:48:57,828] [ INFO]�[0m - prune_embeddings :False�[0m
�[32m[2023-02-27 10:48:57,828] [ INFO]�[0m - recompute :False�[0m
�[32m[2023-02-27 10:48:57,828] [ INFO]�[0m - remove_unused_columns :True�[0m
�[32m[2023-02-27 10:48:57,828] [ INFO]�[0m - report_to :['visualdl']�[0m
�[32m[2023-02-27 10:48:57,829] [ INFO]�[0m - resume_from_checkpoint :None�[0m
�[32m[2023-02-27 10:48:57,829] [ INFO]�[0m - round_type :round�[0m
�[32m[2023-02-27 10:48:57,829] [ INFO]�[0m - run_name :./checkpoint_base_1/model_best�[0m
�[32m[2023-02-27 10:48:57,829] [ INFO]�[0m - save_on_each_node :False�[0m
�[32m[2023-02-27 10:48:57,829] [ INFO]�[0m - save_steps :100�[0m
�[32m[2023-02-27 10:48:57,829] [ INFO]�[0m - save_strategy :IntervalStrategy.STEPS�[0m
�[32m[2023-02-27 10:48:57,829] [ INFO]�[0m - save_total_limit :None�[0m
�[32m[2023-02-27 10:48:57,830] [ INFO]�[0m - scale_loss :32768�[0m
�[32m[2023-02-27 10:48:57,830] [ INFO]�[0m - seed :1000�[0m
�[32m[2023-02-27 10:48:57,830] [ INFO]�[0m - sharding :[]�[0m
�[32m[2023-02-27 10:48:57,830] [ INFO]�[0m - sharding_degree :-1�[0m
�[32m[2023-02-27 10:48:57,830] [ INFO]�[0m - should_log :False�[0m
�[32m[2023-02-27 10:48:57,830] [ INFO]�[0m - should_save :False�[0m
�[32m[2023-02-27 10:48:57,830] [ INFO]�[0m - skip_memory_metrics :True�[0m
�[32m[2023-02-27 10:48:57,831] [ INFO]�[0m - strategy :dynabert+ptq�[0m
�[32m[2023-02-27 10:48:57,831] [ INFO]�[0m - train_batch_size :8�[0m
�[32m[2023-02-27 10:48:57,831] [ INFO]�[0m - use_pact :True�[0m
�[32m[2023-02-27 10:48:57,831] [ INFO]�[0m - warmup_ratio :0.1�[0m
�[32m[2023-02-27 10:48:57,831] [ INFO]�[0m - warmup_steps :0�[0m
�[32m[2023-02-27 10:48:57,831] [ INFO]�[0m - weight_decay :0.0�[0m
�[32m[2023-02-27 10:48:57,831] [ INFO]�[0m - weight_quantize_type :channel_wise_abs_max�[0m
�[32m[2023-02-27 10:48:57,832] [ INFO]�[0m - width_mult_list :None�[0m
�[32m[2023-02-27 10:48:57,832] [ INFO]�[0m - world_size :8�[0m
�[32m[2023-02-27 10:48:57,832] [ INFO]�[0m - �[0m
�[32m[2023-02-27 10:49:00,390] [ INFO]�[0m - ***** Running training *****�[0m
�[32m[2023-02-27 10:49:00,390] [ INFO]�[0m - Num examples = 570�[0m
�[32m[2023-02-27 10:49:00,390] [ INFO]�[0m - Num Epochs = 100�[0m
�[32m[2023-02-27 10:49:00,390] [ INFO]�[0m - Instantaneous batch size per device = 8�[0m
�[32m[2023-02-27 10:49:00,390] [ INFO]�[0m - Total train batch size (w. parallel, distributed & accumulation) = 64�[0m
�[32m[2023-02-27 10:49:00,390] [ INFO]�[0m - Gradient Accumulation steps = 1�[0m
�[32m[2023-02-27 10:49:00,391] [ INFO]�[0m - Total optimization steps = 900.0�[0m
�[32m[2023-02-27 10:49:00,391] [ INFO]�[0m - Total num train samples = 57000.0�[0m
�[32m[2023-02-27 10:49:00,395] [ INFO]�[0m - Number of trainable parameters = 117946370�[0m

wfsdck30

wfsdck307#

从这里的信息暂时看不出问题所在,请问使用2.4版本编包还会报错吗

6tqwzwtp

6tqwzwtp8#

从这里的信息暂时看不出问题所在,请问使用2.4版本编包还会报错吗

2.4版的编译后,加载了paddle会报了这个错误:
(paddle) root@72ce093614d5:~/Setups/PaddleNLP/applications/information_extraction/text# python3
Python 3.7.13 (default, Feb 24 2023, 16:21:25)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import paddle
Error: Can not import paddle core while this file exists: /venv/paddle/lib/python3.7/site-packages/paddle/fluid/libpaddle.so
Traceback (most recent call last):
File "/venv/paddle/lib/python3.7/site-packages/paddle/fluid/core.py", line 274, in
from . import libpaddle
ImportError: /venv/paddle/lib/python3.7/site-packages/paddle/fluid/libpaddle.so: undefined symbol: shm_unlink

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/venv/paddle/lib/python3.7/site-packages/paddle/init.py", line 25, in
from .framework import monkey_patch_variable
File "/venv/paddle/lib/python3.7/site-packages/paddle/framework/init.py", line 17, in
from . import random # noqa: F401
File "/venv/paddle/lib/python3.7/site-packages/paddle/framework/random.py", line 16, in
import paddle.fluid as fluid
File "/venv/paddle/lib/python3.7/site-packages/paddle/fluid/init.py", line 36, in
from . import framework
File "/venv/paddle/lib/python3.7/site-packages/paddle/fluid/framework.py", line 37, in
from . import core
File "/venv/paddle/lib/python3.7/site-packages/paddle/fluid/core.py", line 333, in
if not avx_supported() and libpaddle.is_compiled_with_avx():
NameError: name 'libpaddle' is not defined

相关问题