bug描述 Describe the Bug
目前基于paddleclas 在imagenet2012 上面 训练resnet50 发现loss 不下降
为了验证kernel,特地跑了下resnet18 :
resnet50和resnet18的配置一样的, 所以想问下大概是哪个出了问题,期待你们的回复,谢谢!
配置如下:
batch size 两个网络都是用的 256 ,都是单卡
Arch:
name: ResNet50
class_num: 1000
input_image_channel: *image_channel
data_format: "NHWC"
Loss:
Train:
- CELoss:
weight: 1.0
Eval: - CELoss:
weight: 1.0
Optimizer:
name: SGD
lr:
name: Piecewise
learning_rate: 0.1
decay_epochs: [30, 60, 90]
values: [0.1, 0.01, 0.001, 0.0001]
regularizer:
name: 'L2'
coeff: 0.0001
其他补充信息 Additional Supplementary Information
No response
7条答案
按热度按时间o2gm4chl1#
请提供下运行方式,以及Paddle版本,cuda和cudnn版本等详细信息,另外可以把训练日志也上传一下
olqngx592#
运行方式:python ./tools/train.py -c /home/compat/chaofanl/PaddleClas/ResNet50-new.yaml
paddlepaddle :f55b387df0f473574f82c83da0c4c821829f35a7 (Date: Tue Apr 25 21:02:37 2023)
paddleclas :8ed2060033bfccef58c6e07a96ab4181f72fa7c5 (Date: Fri Jul 7 07:45:52 2023)
我使用的是intel-gpu
train.log
bcs8qyzn3#
yaml文件如下
Global:
checkpoints: null
pretrained_model: null
output_dir: ./output-new/
device: intel_gpu
save_interval: 1
eval_during_train: False
eval_interval: 1
epochs: 120
print_batch_step: 1
use_visualdl: True
image_channel: &image_channel 3
image_shape: [*image_channel, 224, 224]
save_inference_dir: ./inference
to_static: False
Arch:
name: ResNet50
class_num: 1000
input_image_channel: *image_channel
data_format: "NHWC"
Loss:
Train:
weight: 1.0
Eval:
weight: 1.0
Optimizer:
name: SGD
lr:
name: Piecewise
learning_rate: 0.1
decay_epochs: [30, 60, 90]
values: [0.1, 0.01, 0.001, 0.0001]
regularizer:
name: 'L2'
coeff: 0.0001
DataLoader:
Train:
dataset:
name: ImageNetDataset
image_root: /home_notmod/dataset_broad/dataset/imagenet/img_raw
cls_label_path: /home_notmod/dataset_broad/dataset/imagenet/img_raw/train_list.txt
transform_ops:
to_rgb: True
channel_first: False
size: 224
flip_code: 1
scale: 1.0/255.0
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: ''
channel_num: *image_channel
Eval:
dataset:
name: ImageNetDataset
image_root: /home_notmod/dataset_broad/dataset/imagenet/img_raw
cls_label_path: /home_notmod/dataset_broad/dataset/imagenet/img_raw/val_list.txt
transform_ops:
to_rgb: True
channel_first: False
resize_short: 256
size: 224
scale: 1.0/255.0
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: ''
channel_num: *image_channel
sampler:
name: DistributedBatchSampler
batch_size: 64
drop_last: False
shuffle: False
loader:
num_workers: 4
use_shared_memory: True
Infer:
infer_imgs: docs/images/inference_deployment/whl_demo.jpg
batch_size: 10
transforms:
to_rgb: True
channel_first: False
resize_short: 256
size: 224
scale: 1.0/255.0
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: ''
channel_num: *image_channel
PostProcess:
name: Topk
topk: 5
class_id_map_file: ppcls/utils/imagenet1k_label_list.txt
Metric:
Train:
topk: [1, 5]
Eval:
topk: [1, 5]
qybjjes14#
我当初也是有这个么问题,用paddle.vision.model 的模型训练没法收敛,自己随便搭个四五层的都能收敛。。。你发现问题没
4szc88ey5#
我当初也是有这个么问题,用paddle.vision.model 的模型训练没法收敛,自己随便搭个四五层的都能收敛。。。你发现问题没
你后来怎么解决的问题?
wlp8pajw6#
我当初也是有这个么问题,用paddle.vision.model 的模型训练没法收敛,自己随便搭个四五层的都能收敛。。。你发现问题没
你后来怎么解决的问题?
后来用paddlecls套件训练的,套件训练没问题,套件用paddle的版本比较低。具体怎么回事我也不清楚
r1wp621o7#
能否使用PaddleClas套件的release2.5分支,同时用Paddle最新的release版本试试看是否还存在这个问题?