learning_rate = 0.001
使用一个cpu和gpu设备。
在cpu上训练时,前20个epoch的训练结果以及loss下降如下:
Batch 0, loss 2.522635, acc 0.088436
Done epoch: 0
Epoch: 1
Batch 0, loss 2.114787, acc 0.211000
Done epoch: 1
Epoch: 2
Batch 0, loss 2.454542, acc 0.079980
Done epoch: 2
Epoch: 3
Batch 0, loss 1.930440, acc 0.284088
Done epoch: 3
Epoch: 4
Batch 0, loss 1.789533, acc 0.447309
Done epoch: 4
Epoch: 5
Batch 0, loss 1.745031, acc 0.273679
Done epoch: 5
Epoch: 6
Batch 0, loss 1.684448, acc 0.412491
Done epoch: 6
Epoch: 7
Batch 0, loss 1.654690, acc 0.362373
Done epoch: 7
Epoch: 8
Batch 0, loss 1.588757, acc 0.495754
Done epoch: 8
Epoch: 9
Batch 0, loss 1.572071, acc 0.514561
Done epoch: 9
Epoch: 10
Batch 0, loss 1.540798, acc 0.545184
Done epoch: 10
Epoch: 11
Batch 0, loss 1.591666, acc 0.621386
Done epoch: 11
Epoch: 12
Batch 0, loss 1.504963, acc 0.645011
Done epoch: 12
Epoch: 13
Batch 0, loss 1.534026, acc 0.642602
Done epoch: 13
Epoch: 14
Batch 0, loss 1.476046, acc 0.655928
Done epoch: 14
Epoch: 15
Batch 0, loss 1.491074, acc 0.670692
Done epoch: 15
Epoch: 16
Batch 0, loss 1.444524, acc 0.703621
Done epoch: 16
Epoch: 17
Batch 0, loss 1.514576, acc 0.728161
Done epoch: 17
Epoch: 18
Batch 0, loss 1.492055, acc 0.758072
Done epoch: 18
Epoch: 19
Batch 0, loss 1.408592, acc 0.801796
Done epoch: 19
Epoch: 20
Batch 0, loss 1.415996, acc 0.744361
Done epoch: 20
################################################
在gpu上训练,前20个epoch的训练结果如下:
Epoch: 0
Batch 0, loss 2.440385, acc 0.112815
Done epoch: 0
Epoch: 1
Batch 0, loss 2.238379, acc 0.001942
Done epoch: 1
Epoch: 2
Batch 0, loss 2.268656, acc 0.002467
Done epoch: 2
Epoch: 3
Batch 0, loss 2.236463, acc 0.000704
Done epoch: 3
Epoch: 4
Batch 0, loss 2.211993, acc 0.000000
Done epoch: 4
Epoch: 5
Batch 0, loss 2.192462, acc 0.001938
Done epoch: 5
Epoch: 6
Batch 0, loss 2.183870, acc 0.001938
Done epoch: 6
Epoch: 7
Batch 0, loss 2.173697, acc 0.000000
Done epoch: 7
Epoch: 8
Batch 0, loss 2.173233, acc 0.000000
Done epoch: 8
Epoch: 9
Batch 0, loss 2.154153, acc 0.000008
Done epoch: 9
Epoch: 10
Batch 0, loss 2.143353, acc 0.000008
Done epoch: 10
Epoch: 10
Batch 0, loss 2.143353, acc 0.000008
Done epoch: 10
Epoch: 11
Batch 0, loss 2.136832, acc 0.390633
Done epoch: 11
Epoch: 12
Batch 0, loss 2.122867, acc 0.446275
Done epoch: 12
Epoch: 13
Batch 0, loss 2.113915, acc 0.441837
Done epoch: 13
Epoch: 14
Batch 0, loss 2.116818, acc 0.341406
Done epoch: 14
Epoch: 15
Batch 0, loss 2.091358, acc 0.479894
Done epoch: 15
Epoch: 16
Batch 0, loss 2.094740, acc 0.422867
Done epoch: 16
Epoch: 17
Batch 0, loss 2.086677, acc 0.438015
Done epoch: 17
Epoch: 18
Batch 0, loss 2.080353, acc 0.438019
Done epoch: 18
Epoch: 19
Batch 0, loss 2.083027, acc 0.411423
Done epoch: 19
Epoch: 20
Batch 0, loss 2.068788, acc 0.488256
Done epoch: 20
###############################################
代码配置如下:
if use_gpu:
places = fluid.cuda_places()
exe = fluid.Executor(place=fluid.CUDAPlace(0))
else:
cpu_num = 1
places = fluid.cpu_places(cpu_num)
os.environ['CPU_NUM'] = str(cpu_num)
exe = fluid.Executor(place=fluid.CPUPlace())
在启动GPU时候设置了环境变量CUDA_VISIBLE_DEVICES=0。
观察了gpu的训练情况,epoch=400多时候,准确率依然停留在40-50%之间。请问可能是什么原因导致的。
3条答案
按热度按时间uoifb46i1#
1、确定模型中有没有类似dropout之类的概率性的OP
2、能否提供下cpu、gpu的型号信息等
ej83mcc02#
3、模型参数在开始训练的时候是不是一致,有没有种子不一样的情况
4、每次iter选取的数据是否是一样的
如果没有任何随机性的东西,建议每个iter保存下每个OP的输出和参数,对比CPU和GPU的结果,看看有没有相差特别大的某个OP
rur96b6h3#
1、确定模型中有没有类似dropout之类的概率性的OP
2、能否提供下cpu、gpu的型号信息等
2.cpu是Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, gpu是k40
3.初始化参数都是用exe.run(inference_startup)的program自动初始化的
4.我选取的数据只有一条数据,来来回回都是同一条数据在训练。