Paddle Resnet50 training loss不下降

qyswt5oh  于 4个月前  发布在  其他
关注(0)|答案(7)|浏览(38)

bug描述 Describe the Bug

目前基于paddleclas 在imagenet2012 上面 训练resnet50 发现loss 不下降

为了验证kernel,特地跑了下resnet18 :

resnet50和resnet18的配置一样的, 所以想问下大概是哪个出了问题,期待你们的回复,谢谢!
配置如下:
batch size 两个网络都是用的 256 ,都是单卡
Arch:
name: ResNet50
class_num: 1000
input_image_channel: *image_channel
data_format: "NHWC"

Loss:
Train:

  • CELoss:
    weight: 1.0
    Eval:
  • CELoss:
    weight: 1.0
    Optimizer:
    name: SGD
    lr:
    name: Piecewise
    learning_rate: 0.1
    decay_epochs: [30, 60, 90]
    values: [0.1, 0.01, 0.001, 0.0001]
    regularizer:
    name: 'L2'
    coeff: 0.0001

其他补充信息 Additional Supplementary Information

No response

o2gm4chl

o2gm4chl1#

请提供下运行方式,以及Paddle版本,cuda和cudnn版本等详细信息,另外可以把训练日志也上传一下

olqngx59

olqngx592#

运行方式:python ./tools/train.py -c /home/compat/chaofanl/PaddleClas/ResNet50-new.yaml
paddlepaddle :f55b387df0f473574f82c83da0c4c821829f35a7 (Date: Tue Apr 25 21:02:37 2023)
paddleclas :8ed2060033bfccef58c6e07a96ab4181f72fa7c5 (Date: Fri Jul 7 07:45:52 2023)
我使用的是intel-gpu

train.log

bcs8qyzn

bcs8qyzn3#

yaml文件如下
Global:
checkpoints: null
pretrained_model: null
output_dir: ./output-new/
device: intel_gpu
save_interval: 1
eval_during_train: False
eval_interval: 1
epochs: 120
print_batch_step: 1
use_visualdl: True
image_channel: &image_channel 3
image_shape: [*image_channel, 224, 224]
save_inference_dir: ./inference
to_static: False

Arch:
name: ResNet50
class_num: 1000
input_image_channel: *image_channel
data_format: "NHWC"
Loss:
Train:

  • CELoss:
    weight: 1.0
    Eval:
  • CELoss:
    weight: 1.0
    Optimizer:
    name: SGD
    lr:
    name: Piecewise
    learning_rate: 0.1
    decay_epochs: [30, 60, 90]
    values: [0.1, 0.01, 0.001, 0.0001]
    regularizer:
    name: 'L2'
    coeff: 0.0001

DataLoader:
Train:
dataset:
name: ImageNetDataset
image_root: /home_notmod/dataset_broad/dataset/imagenet/img_raw
cls_label_path: /home_notmod/dataset_broad/dataset/imagenet/img_raw/train_list.txt
transform_ops:

  • DecodeImage:
    to_rgb: True
    channel_first: False
  • RandCropImage:
    size: 224
  • RandFlipImage:
    flip_code: 1
  • NormalizeImage:
    scale: 1.0/255.0
    mean: [0.485, 0.456, 0.406]
    std: [0.229, 0.224, 0.225]
    order: ''
    channel_num: *image_channel
sampler:
  name: DistributedBatchSampler
  batch_size: 256
  drop_last: False
  shuffle: True
loader:
  num_workers: 4
  use_shared_memory: True

Eval:
dataset:
name: ImageNetDataset
image_root: /home_notmod/dataset_broad/dataset/imagenet/img_raw
cls_label_path: /home_notmod/dataset_broad/dataset/imagenet/img_raw/val_list.txt
transform_ops:

  • DecodeImage:
    to_rgb: True
    channel_first: False
  • ResizeImage:
    resize_short: 256
  • CropImage:
    size: 224
  • NormalizeImage:
    scale: 1.0/255.0
    mean: [0.485, 0.456, 0.406]
    std: [0.229, 0.224, 0.225]
    order: ''
    channel_num: *image_channel
    sampler:
    name: DistributedBatchSampler
    batch_size: 64
    drop_last: False
    shuffle: False
    loader:
    num_workers: 4
    use_shared_memory: True

Infer:
infer_imgs: docs/images/inference_deployment/whl_demo.jpg
batch_size: 10
transforms:

  • DecodeImage:
    to_rgb: True
    channel_first: False
  • ResizeImage:
    resize_short: 256
  • CropImage:
    size: 224
  • NormalizeImage:
    scale: 1.0/255.0
    mean: [0.485, 0.456, 0.406]
    std: [0.229, 0.224, 0.225]
    order: ''
    channel_num: *image_channel
  • ToCHWImage:
    PostProcess:
    name: Topk
    topk: 5
    class_id_map_file: ppcls/utils/imagenet1k_label_list.txt

Metric:
Train:

  • TopkAcc:
    topk: [1, 5]
    Eval:
  • TopkAcc:
    topk: [1, 5]
qybjjes1

qybjjes14#

我当初也是有这个么问题,用paddle.vision.model 的模型训练没法收敛,自己随便搭个四五层的都能收敛。。。你发现问题没

4szc88ey

4szc88ey5#

我当初也是有这个么问题,用paddle.vision.model 的模型训练没法收敛,自己随便搭个四五层的都能收敛。。。你发现问题没

你后来怎么解决的问题?

wlp8pajw

wlp8pajw6#

我当初也是有这个么问题,用paddle.vision.model 的模型训练没法收敛,自己随便搭个四五层的都能收敛。。。你发现问题没

你后来怎么解决的问题?
后来用paddlecls套件训练的,套件训练没问题,套件用paddle的版本比较低。具体怎么回事我也不清楚

r1wp621o

r1wp621o7#

能否使用PaddleClas套件的release2.5分支,同时用Paddle最新的release版本试试看是否还存在这个问题?

相关问题