Paddle lr已经设到很小了,总是在第20个iter出现loss:NAN

nzrxty8p  于 2021-12-07  发布在  Java
  • 版本、环境信息:

Paddle version: 2.0.0-rc0
Paddle With CUDA: True
OS: Windows 10
Python version: 3.7.0
CUDA version: 10.2.89
cuDNN version: 7.6.5
Nvidia driver version: 457.09

  • 训练信息

   2)显存信息:显卡NVIDIA GeForce RTX3070 8.0GB

利用网络 backbone 为ResNet50-vd-FPN-Dcnv2,网络类型为Cascade Faster,配置文件如下:
architecture: CascadeRCNN
max_iters: 30000
snapshot_iter: 3000
use_gpu: true
log_smooth_window: 20
log_iter: 20
save_dir: output
weights: output/cascade_rcnn_dcn_r50_vd_fpn_gen_server_side_traffic4/model_final
metric: VOC
num_classes: 5

backbone: ResNet
fpn: FPN
rpn_head: FPNRPNHead
roi_extractor: FPNRoIAlign
bbox_head: CascadeBBoxHead
bbox_assigner: CascadeBBoxAssigner

norm_type: bn
depth: 50
feature_maps: [2, 3, 4, 5]
freeze_at: 2
variant: d
dcn_v2_stages: [3, 4, 5]
lr_mult_list: [0.05, 0.05, 0.1, 0.15]

max_level: 6
min_level: 2
num_chan: 64
spatial_scale: [0.03125, 0.0625, 0.125, 0.25]

anchor_sizes: [32, 64, 128, 256, 512]
aspect_ratios: [0.5, 1.0, 2.0]
stride: [16.0, 16.0]
variance: [1.0, 1.0, 1.0, 1.0]
anchor_start_size: 32
min_level: 2
max_level: 6
num_chan: 64
rpn_batch_size_per_im: 256
rpn_fg_fraction: 0.5
rpn_positive_overlap: 0.7
rpn_negative_overlap: 0.3
rpn_straddle_thresh: 0.0
min_size: 0.0
nms_thresh: 0.7
pre_nms_top_n: 2000
post_nms_top_n: 2000
min_size: 0.0
nms_thresh: 0.7
pre_nms_top_n: 500
post_nms_top_n: 300

canconical_level: 4
canonical_size: 224
min_level: 2
max_level: 5
box_resolution: 7
sampling_ratio: 2

batch_size_per_im: 512
bbox_reg_weights: [10, 20, 30]
bg_thresh_lo: [0.0, 0.0, 0.0]
bg_thresh_hi: [0.5, 0.6, 0.7]
fg_thresh: [0.5, 0.6, 0.7]
fg_fraction: 0.25

head: CascadeTwoFCHead
bbox_loss: BalancedL1Loss
keep_top_k: 100
nms_threshold: 0.5
score_threshold: 0.05

alpha: 0.5
gamma: 1.5
beta: 1.0
loss_weight: 1.0

mlp_dim: 1024

base_lr: 0.0000125

  • !PiecewiseDecay

gamma: 0.1
milestones: [24000, 26000]

  • !LinearWarmup

start_factor: 0.1
steps: 1000

momentum: 0.9
type: Momentum
factor: 0.0001
type: L2

fields: ['image', 'im_info', 'im_id', 'gt_bbox', 'gt_class', 'is_crowd']
anno_path: train.txt
dataset_dir: dataset/traffic_light4
use_default_label: false

  • !DecodeImage

to_rgb: true

  • !RandomFlipImage

prob: 0.5

  • !AutoAugmentImage

autoaug_type: v1

  • !NormalizeImage

is_channel_first: false
is_scale: true
mean: [0.485,0.456,0.406]
std: [0.229, 0.224,0.225]

  • !ResizeImage

target_size: [640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024]
max_size: 1500
interp: 1
use_cv2: true

  • !Permute

to_bgr: false
channel_first: true

  • !PadBatch

pad_to_stride: 32
use_padded_im_info: false
batch_size: 2
shuffle: true
worker_num: 2
use_process: false

#fields: ['image', 'im_info', 'im_id', 'im_shape']

for voc

fields: ['image', 'im_info', 'im_id','im_shape', 'gt_bbox', 'gt_class', 'is_difficult']
anno_path: val.txt
dataset_dir: dataset/traffic_light4
use_default_label: false

  • !DecodeImage

to_rgb: true
with_mixup: false

  • !NormalizeImage

is_channel_first: false
is_scale: true
mean: [0.485,0.456,0.406]
std: [0.229, 0.224,0.225]

  • !ResizeImage

interp: 1
max_size: 1500
target_size: 1000
use_cv2: true

  • !Permute

channel_first: true
to_bgr: false

  • !PadBatch

pad_to_stride: 32
use_padded_im_info: true
batch_size: 1
shuffle: false
drop_empty: false
worker_num: 2


set image_shape if needed

fields: ['image', 'im_info', 'im_id', 'im_shape']
use_default_label: false
with_background: true
anno_path: dataset/traffic_light4/label_list.txt

  • !DecodeImage

to_rgb: true
with_mixup: false

  • !NormalizeImage

is_channel_first: false
is_scale: true
mean: [0.485,0.456,0.406]
std: [0.229, 0.224,0.225]

  • !ResizeImage

interp: 1
max_size: 1500
target_size: 1000
use_cv2: true

  • !Permute

channel_first: true
to_bgr: false

  • !PadBatch

pad_to_stride: 32
use_padded_im_info: true
batch_size: 1
shuffle: false

C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\backbones\
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\losses\
The behavior of expression A - B has been unified with elementwise_sub(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_sub(X, Y, axis=0) instead of A - B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\losses\
The behavior of expression A * B has been unified with elementwise_mul(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_mul(X, Y, axis=0) instead of A * B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\losses\
The behavior of expression A - B has been unified with elementwise_sub(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_sub(X, Y, axis=0) instead of A - B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\losses\
The behavior of expression A * B has been unified with elementwise_mul(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_mul(X, Y, axis=0) instead of A * B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\losses\
The behavior of expression A * B has been unified with elementwise_mul(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_mul(X, Y, axis=0) instead of A * B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\losses\
The behavior of expression A * B has been unified with elementwise_mul(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_mul(X, Y, axis=0) instead of A * B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\losses\
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\losses\
The behavior of expression A * B has been unified with elementwise_mul(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_mul(X, Y, axis=0) instead of A * B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
2020-12-22 18:38:04,668-INFO: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000100] in Optimizer will not take effect, and it will only be applied to other Parameters!
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\roi_heads\
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\layers\ UserWarning: D:\lyx\PaddleDetection-release-0.5\ppdet\modeling\roi_heads\
The behavior of expression A / B has been unified with elementwise_div(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_div(X, Y, axis=0) instead of A / B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
W1222 18:38:23.042533 10816] Please NOTE: device: 0, CUDA Capability: 86, Driver API Version: 11.1, Runtime API Version: 10.2
W1222 18:38:23.126313 10816] device: 0, cuDNN Version: 7.6.
C:\ProgramData\Anaconda3\envs\pp\lib\site-packages\paddle\fluid\ UserWarning: This list is not set, Because of Paramerter not found in program. There are: fc_0.b_0 fc_0.w_0
format(" ".join(unused_para_list)))
W1222 18:59:18.041463 10816] fusion_group is not enabled for Windows/MacOS now, and only effective when running with CUDA GPU.
D:\lyx\PaddleDetection-release-0.5\ppdet\data\ DeprecationWarning: Using or importing the ABCs from 'collections' instead of from '' is deprecated, and in 3.8 it will stop working
if isinstance(item, collections.Sequence) and len(item) == 0:
D:\lyx\PaddleDetection-release-0.5\ppdet\data\transform\ DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
if 'replace' in inspect.getargspec(func)[0]:
D:\lyx\PaddleDetection-release-0.5\ppdet\data\transform\ DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
assert 'replace' == inspect.getargspec(func)[0][-1]
D:\lyx\PaddleDetection-release-0.5\ppdet\data\transform\ DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
if 'bboxes' not in inspect.getargspec(func)[0]:
D:\lyx\PaddleDetection-release-0.5\ppdet\data\transform\ DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
if 'prob' in inspect.getargspec(func)[0]:
D:\lyx\PaddleDetection-release-0.5\ppdet\data\transform\ DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
assert 'bboxes' == inspect.getargspec(func)[0][1]
D:\lyx\PaddleDetection-release-0.5\ppdet\data\transform\ DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
if 'prob' in inspect.getargspec(func)[0]:
tools/ RuntimeWarning: divide by zero encountered in double_scalars
ips = float(cfg['TrainReader']['batch_size']) / time_cost
2020-12-22 18:59:20,002-INFO: iter: 0, lr: 0.000001, 'loss_cls_0': '7.239295', 'loss_loc_0': '0.004803', 'loss_cls_1': '2.924091', 'loss_loc_1': '0.001301', 'loss_cls_2': '2.735808', 'loss_loc_2': '0.001601', 'loss_rpn_cls': '30.671740', 'loss_rpn_bbox': '0.021397', 'loss': '43.600033', eta: 0:00:00, batch_cost: 0.00000 sec, ips: inf images/sec
2020-12-22 18:59:32,541-INFO: iter: 20, lr: 0.000001, 'loss_cls_0': '1.609347', 'loss_loc_0': '0.000000', 'loss_cls_1': '0.804670', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.402333', 'loss_loc_2': '0.000000', 'loss_rpn_cls': 'nan', 'loss_rpn_bbox': 'nan', 'loss': 'nan', eta: 5:44:36, batch_cost: 0.68969 sec, ips: 2.89985 images/sec
2020-12-22 18:59:45,374-INFO: iter: 40, lr: 0.000002, 'loss_cls_0': '1.608831', 'loss_loc_0': '0.000000', 'loss_cls_1': '0.804448', 'loss_loc_1': '-0.000000', 'loss_cls_2': '0.402221', 'loss_loc_2': '0.000000', 'loss_rpn_cls': 'nan', 'loss_rpn_bbox': 'nan', 'loss': 'nan', eta: 5:23:35, batch_cost: 0.64804 sec, ips: 3.08623 images/sec
2020-12-22 18:59:57,319-INFO: iter: 60, lr: 0.000002, 'loss_cls_0': '1.608098', 'loss_loc_0': '0.000000', 'loss_cls_1': '0.804157', 'loss_loc_1': '-0.000000', 'loss_cls_2': '0.402075', 'loss_loc_2': '0.000000', 'loss_rpn_cls': 'nan', 'loss_rpn_bbox': 'nan', 'loss': 'nan', eta: 4:56:07, batch_cost: 0.59344 sec, ips: 3.37016 images/sec
2020-12-22 19:00:08,984-INFO: iter: 80, lr: 0.000002, 'loss_cls_0': '1.607245', 'loss_loc_0': '0.000000', 'loss_cls_1': '0.803822', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.401908', 'loss_loc_2': '0.000000', 'loss_rpn_cls': 'nan', 'loss_rpn_bbox': 'nan', 'loss': 'nan', eta: 4:56:26, batch_cost: 0.59448 sec, ips: 3.36430 images/sec
2020-12-22 19:00:20,621-INFO: iter: 100, lr: 0.000002, 'loss_cls_0': '1.606289', 'loss_loc_0': '0.000000', 'loss_cls_1': '0.803448', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.401720', 'loss_loc_2': '0.000000', 'loss_rpn_cls': 'nan', 'loss_rpn_bbox': 'nan', 'loss': 'nan', eta: 4:46:22, batch_cost: 0.57468 sec, ips: 3.48020 images/sec
2020-12-22 19:00:33,152-INFO: iter: 120, lr: 0.000003, 'loss_cls_0': '1.605234', 'loss_loc_0': '0.000000', 'loss_cls_1': '0.803034', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.401513', 'loss_loc_2': '0.000000', 'loss_rpn_cls': 'nan', 'loss_rpn_bbox': 'nan', 'loss': 'nan', eta: 5:11:24, batch_cost: 0.62533 sec, ips: 3.19833 images/sec
2020-12-22 19:00:46,113-INFO: iter: 140, lr: 0.000003, 'loss_cls_0': '1.604078', 'loss_loc_0': '0.000000', 'loss_cls_1': '0.802581', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.401287', 'loss_loc_2': '0.000000', 'loss_rpn_cls': 'nan', 'loss_rpn_bbox': 'nan', 'loss': 'nan', eta: 5:24:37, batch_cost: 0.65229 sec, ips: 3.06613 images/sec
2020-12-22 19:00:59,035-INFO: iter: 160, lr: 0.000003, 'loss_cls_0': '1.602824', 'loss_loc_0': '0.000000', 'loss_cls_1': '0.802090', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.401041', 'loss_loc_2': '0.000000', 'loss_rpn_cls': 'nan', 'loss_rpn_bbox': 'nan', 'loss': 'nan', eta: 5:19:38, batch_cost: 0.64272 sec, ips: 3.11175 images/sec




Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!






谢谢回复~我再仔细的检查我的数据集,并没有发现标签对应关系的错误,我也拿该数据集在另一台同为win10的单卡单机,显卡为GTX 1060 6.0GB的机器上进行训练,其中配置文件除了lr改为0.00125以外其余与之前的相同,其训练输出如下:
2020-12-23 10:21:43,046-INFO: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000100] in Optimizer will not take effect, and it will only be applied to other Parameters!
2020-12-23 10:21:49,138-INFO: places would be ommited when DataLoader is not iterable
2020-12-23 10:22:10,298-WARNING: C:\Users\Fundway/.cache/paddle/weights\ResNet50_vd_ssld_v2_pretrained.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ]
C:\Users\Fundway\AppData\Local\conda\conda\envs\testpp\lib\site-packages\paddle\fluid\ UserWarning: This list is not set, Because of Paramerter not found in program. There are: fc_0.b_0 fc_0.w_0
format(" ".join(unused_para_list)))
2020-12-23 10:22:53,611-INFO: places would be ommited when DataLoader is not iterable
2020-12-23 10:22:58,113-INFO: iter: 0, lr: 0.000125, 'loss_cls_0': '1.596302', 'loss_loc_0': '0.000002', 'loss_cls_1': '0.768629', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.394244', 'loss_loc_2': '0.000000', 'loss_rpn_cls': '0.695158', 'loss_rpn_bbox': '0.008708', 'loss': '3.463043', time: 0.000, eta: 0:00:00
2020-12-23 10:23:24,056-INFO: iter: 20, lr: 0.000148, 'loss_cls_0': '1.253679', 'loss_loc_0': '0.000003', 'loss_cls_1': '0.597673', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.311132', 'loss_loc_2': '0.000000', 'loss_rpn_cls': '0.694021', 'loss_rpn_bbox': '0.010990', 'loss': '2.863490', time: 1.449, eta: 8 days, 9:18:20
2020-12-23 10:23:49,095-INFO: iter: 40, lr: 0.000170, 'loss_cls_0': '0.021264', 'loss_loc_0': '0.000019', 'loss_cls_1': '0.001989', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.001103', 'loss_loc_2': '0.000000', 'loss_rpn_cls': '0.691823', 'loss_rpn_bbox': '0.010662', 'loss': '0.736680', time: 1.262, eta: 7 days, 7:24:05
2020-12-23 10:24:15,180-INFO: iter: 60, lr: 0.000193, 'loss_cls_0': '0.018982', 'loss_loc_0': '0.000025', 'loss_cls_1': '0.000775', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.000427', 'loss_loc_2': '0.000000', 'loss_rpn_cls': '0.681162', 'loss_rpn_bbox': '0.008746', 'loss': '0.709770', time: 1.282, eta: 7 days, 10:02:55
2020-12-23 10:24:39,396-INFO: iter: 80, lr: 0.000215, 'loss_cls_0': '0.058617', 'loss_loc_0': '0.000007', 'loss_cls_1': '0.017247', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.005145', 'loss_loc_2': '0.000000', 'loss_rpn_cls': '0.661259', 'loss_rpn_bbox': '0.009319', 'loss': '0.757371', time: 1.264, eta: 7 days, 7:38:40
2020-12-23 10:25:07,956-INFO: iter: 100, lr: 0.000238, 'loss_cls_0': '0.044691', 'loss_loc_0': '0.000004', 'loss_cls_1': '0.007914', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.003082', 'loss_loc_2': '0.000000', 'loss_rpn_cls': '0.625538', 'loss_rpn_bbox': '0.009820', 'loss': '0.694501', time: 1.413, eta: 8 days, 4:20:26
2020-12-23 10:25:28,858-INFO: iter: 120, lr: 0.000260, 'loss_cls_0': '0.036790', 'loss_loc_0': '0.000003', 'loss_cls_1': '0.002068', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.000903', 'loss_loc_2': '0.000000', 'loss_rpn_cls': '0.510809', 'loss_rpn_bbox': '0.009324', 'loss': '0.575176', time: 1.053, eta: 6 days, 2:19:39
2020-12-23 10:25:48,041-INFO: iter: 140, lr: 0.000282, 'loss_cls_0': '0.045195', 'loss_loc_0': '0.000003', 'loss_cls_1': '0.000785', 'loss_loc_1': '0.000000', 'loss_cls_2': '0.000391', 'loss_loc_2': '0.000000', 'loss_rpn_cls': '0.289394', 'loss_rpn_bbox': '0.013347', 'loss': '0.351146', time: 0.938, eta: 5 days, 10:19:47
