Paddle 【论文复现】多卡训练时运行时间变长的问题

fv2wmkja  于 2022-04-21  发布在  Java
关注(0)|答案(29)|浏览(642)
  • 版本、环境信息:

   1)PaddlePaddle版本:请提供您的PaddlePaddle版本号,1.8.2
   2)GPU:aistudio平台的脚步任务环境(选择的4卡环境进行运行测试)
项目连接:https://aistudio.baidu.com/aistudio/clusterprojectdetail/773316(已公开)
现象:每次运行到第二个epoch时发现每次循环的运行时间都变长了一倍,运行时间包括reader读数据的时间,进一步测试发现gpu的运行时间没有变化,主要是reader读数据的时间变长了一倍,不知道是什么原因。具体现象可以看输出的日志文件:
Loss at epoch 0 step 5: [4.3775268], acc: [0.125],period: 16.51509690284729
Loss at epoch 0 step 10: [3.8490183], acc: [0.40625],period: 16.0886127948761
Loss at epoch 0 step 15: [2.9390435], acc: [0.5625],period: 16.38063144683838
Loss at epoch 0 step 20: [2.3567476], acc: [0.65625],period: 16.37914490699768
Loss at epoch 0 step 25: [1.5900159], acc: [0.734375],period: 16.665042877197266
Loss at epoch 0 step 30: [1.2181798], acc: [0.734375],period: 15.94478726387024
Loss at epoch 0 step 35: [0.6449917], acc: [0.90625],period: 16.192134141921997
valid Loss at step 0: [0.25852293], acc: [1.]
valid Loss at step 20: [0.46329147], acc: [0.875]
valid Loss at step 40: [1.2772802], acc: [0.71875]
valid Loss at step 60: [1.3179752], acc: [0.6875]
valid Loss at step 80: [1.1119821], acc: [0.75]
start data reader (trainers_num: 4, trainer_id: 1)
valid Loss at step 100: [2.2030392], acc: [0.1875]
start data reader (trainers_num: 4, trainer_id: 3)
start data reader (trainers_num: 4, trainer_id: 2)
验证集准确率为:0.8328919410705566
验证集loss为:0.7212506532669067
Epoch1 lr=0.009999999776482582
start data reader (trainers_num: 4, trainer_id: 0)
Loss at epoch 1 step 5: [0.55662304], acc: [0.890625],period: 36.16345572471619
Loss at epoch 1 step 10: [0.42250103], acc: [0.875],period: 37.54211711883545
Loss at epoch 1 step 15: [0.27849782], acc: [0.96875],period: 37.35439491271973
Loss at epoch 1 step 20: [0.33116293], acc: [0.90625],period: 35.67357039451599
Loss at epoch 1 step 25: [0.37910157], acc: [0.890625],period: 37.441500663757324
Loss at epoch 1 step 30: [0.36237404], acc: [0.953125],period: 37.42715764045715
Loss at epoch 1 step 35: [0.36434418], acc: [0.921875],period: 35.077528953552246
valid Loss at step 0: [0.08099228], acc: [1.]
valid Loss at step 20: [0.43469694], acc: [0.84375]
valid Loss at step 40: [1.1833289], acc: [0.71875]
valid Loss at step 60: [0.94777864], acc: [0.71875]
valid Loss at step 80: [0.42069796], acc: [0.9375]
valid Loss at step 100: [1.0385896], acc: [0.71875]
start data reader (trainers_num: 4, trainer_id: 2)
start data reader (trainers_num: 4, trainer_id: 3)
start data reader (trainers_num: 4, trainer_id: 1)
验证集准确率为:0.8781779408454895
验证集loss为:0.44851189851760864
Epoch2 lr=0.0010000000474974513
start data reader (trainers_num: 4, trainer_id: 0)
Loss at epoch 2 step 5: [0.38328272], acc: [0.921875],period: 37.521843671798706
Loss at epoch 2 step 10: [0.3836357], acc: [0.90625],period: 36.96732306480408
Loss at epoch 2 step 15: [0.2954501], acc: [0.9375],period: 37.78758931159973
Loss at epoch 2 step 20: [0.28290433], acc: [0.9375],period: 36.79577970504761
Loss at epoch 2 step 25: [0.23077166], acc: [0.984375],period: 35.68168759346008
Loss at epoch 2 step 30: [0.303536], acc: [0.9375],period: 37.21159076690674
Loss at epoch 2 step 35: [0.25395435], acc: [0.9375],period: 36.25998377799988
valid Loss at step 0: [0.07491596], acc: [1.]
valid Loss at step 20: [0.39159745], acc: [0.84375]
valid Loss at step 40: [1.0963948], acc: [0.71875]
valid Loss at step 60: [0.9259555], acc: [0.71875]
valid Loss at step 80: [0.5368143], acc: [0.90625]
valid Loss at step 100: [0.93665516], acc: [0.78125]
验证集准确率为:0.8916842937469482
验证集loss为:0.41894418001174927
INFO 2020-09-01 00:58:53,138 launch.py:223] Local procs complete, POD info:rank:0 id:None addr:127.0.0.1 port:None visible_gpu:[] trainers:["gpu:['0'] endpoint:127.0.0.1:52136 rank:0", "gpu:['1'] endpoint:127.0.0.1:35241 rank:1", "gpu:['2'] endpoint:127.0.0.1:47208 rank:2", "gpu:['3'] endpoint:127.0.0.1:49438 rank:3"]

period就是统计的每个iter的运行时间(含读数时间),测试代码如下:
cur_time = 0
for batch_id, data in enumerate(train_reader()):

dy_x_data = np.array([x[0] for x in data]).astype('float32')
            y_data = np.array([[x[1]] for x in data]).astype('int64')

            period = time.time()-cur_time

            cur_time = time.time()
            img = fluid.dygraph.to_variable(dy_x_data)
            label = fluid.dygraph.to_variable(y_data)
            label.stop_gradient = True

            out = train_model(img)
            acc = fluid.layers.accuracy(out,label)
            loss = fluid.layers.cross_entropy(out, label)
            avg_loss = fluid.layers.mean(loss)

            avg_loss = train_model.scale_loss(avg_loss)
            avg_loss.backward()
            train_model.apply_collective_grads()

            opt.minimize(avg_loss)
            train_model.clear_gradients()

            if(dev_id==0):
                if batch_id % 5 == 0 and batch_id>0:
                    print("Loss at epoch {} step {}: {}, acc: {},period: {}".format(i, batch_id, avg_loss.numpy()*4, acc.numpy(),period))
mxg2im7a

mxg2im7a1#

嗯嗯,我正在修改,多线程操作调度本身是有开销的,等待一个线程载入一份数据给训练程序,和等待多个线程中的某一个线程载入数据然后放入一个队列再取出来给训练程序,会增加一些额外的操作

ztigrdn8

ztigrdn82#

方便的话你申请一点算力帮忙调试一下这个问题吧,感谢

6ie5vjzr

6ie5vjzr3#

你改的代码我跑完了,从现象来看好像是线程decode视频的时间变长了。epoch0里面大部分的解码时间都在1s以下,epoch1就上升到了2s
ecode mp4 sample reader_t = 1.0964207649230957
decode mp4 sample reader_t = 1.7234702110290527

decode mp4 sample reader_t = 1.13865065574646

decode mp4 sample reader_t = 0.48386144638061523

decode mp4 sample reader_t = 0.4739949703216553

Loss at epoch 0 step 3: [2.4785604], acc: [0.546875],period: 4.038020849227905,read_t:2.758361339569092

decode mp4 sample reader_t = 1.0598969459533691

decode mp4 sample reader_t = 1.0353162288665771

decode mp4 sample reader_t = 1.531890630722046
decode mp4 sample reader_t = 1.0284180641174316

decode mp4 sample reader_t = 0.44118332862854004

decode mp4 sample reader_t = 0.5898580551147461

Loss at epoch 0 step 6: [0.5190988], acc: [0.921875],period: 3.845128297805786,read_t:2.744197130203247

decode mp4 sample reader_t = 1.0410845279693604

decode mp4 sample reader_t = 1.1522727012634277

decode mp4 sample reader_t = 0.7044045925140381

decode mp4 sample reader_t = 0.9415566921234131

decode mp4 sample reader_t = 1.6548316478729248

decode mp4 sample reader_t = 1.1805496215820312

decode mp4 sample reader_t = 1.1308026313781738

Loss at epoch 1 step 3: [0.01576277], acc: [0.984375],period: 6.871838808059692,read_t:5.923932790756226

decode mp4 sample reader_t = 1.808323860168457

decode mp4 sample reader_t = 1.1243462562561035

decode mp4 sample reader_t = 1.8299620151519775

decode mp4 sample reader_t = 0.8673379421234131

decode mp4 sample reader_t = 1.2235405445098877
decode mp4 sample reader_t = 2.0187885761260986

decode mp4 sample reader_t = 2.4569594860076904

decode mp4 sample reader_t = 1.2698397636413574

decode mp4 sample reader_t = 0.5672931671142578

Loss at epoch 1 step 6: [0.01758973], acc: [1.],period: 6.872966051101685,read_t:5.782782077789307

decode mp4 sample reader_t = 2.0136091709136963
decode mp4 sample reader_t = 2.0276029109954834

decode mp4 sample reader_t = 2.005891799926758

decode mp4 sample reader_t = 2.111921787261963

decode mp4 sample reader_t = 1.1750879287719727

decode mp4 sample reader_t = 2.0800154209136963

dly7yett

dly7yett4#

这是我另外一个工程打印的日志,我怀疑跟多卡环境有关,就切换成单卡环境来测试,果然在单卡环境下循环时间就稳定了,之前也存在这个问题,是静态图的框架写的工程

dzjeubhm

dzjeubhm5#

我看是不是先切换到单卡环境测试,多卡的环境打印输出太混乱了

好的,这个日志是?单卡的?

up9lanfz

up9lanfz6#

我在跑另外一个程序,在静态图的框架下发现切换成单卡环境就没有这个现象了,循环时间很稳定
Epoch 0 iter 420 : acc: 0.4375 aux_loss:1.3545786142349243 cls_loss: 2.078500509262085 period:2.212111234664917
Epoch 0 iter 440 : acc: 0.625 aux_loss:1.2028199434280396 cls_loss: 1.6616206169128418 period:2.212972640991211
Epoch 0 iter 460 : acc: 0.5 aux_loss:1.0021114349365234 cls_loss: 1.3724995851516724 period:2.2145843505859375
Epoch 0 iter 480 : acc: 0.3125 aux_loss:1.341996431350708 cls_loss: 2.1872904300689697 period:2.2130870819091797
Epoch 0 iter 500 : acc: 0.6875 aux_loss:1.0141962766647339 cls_loss: 1.1664724349975586 period:3.60237979888916
Epoch 0 iter 520 : acc: 0.5625 aux_loss:1.1757774353027344 cls_loss: 2.291337251663208 period:2.2142703533172607
Epoch 0 iter 540 : acc: 0.25 aux_loss:1.2600080966949463 cls_loss: 2.6120495796203613 period:2.212618827819824
Epoch 0 iter 560 : acc: 0.5625 aux_loss:0.9054265022277832 cls_loss: 1.2888394594192505 period:2.2169723510742188
Epoch 0 iter 580 : acc: 0.375 aux_loss:1.5554670095443726 cls_loss: 2.8818106651306152 period:2.2082650661468506
share_vars_from is set, scope is ignored.
Valid iter 20 : acc: 0.5 aux_loss:1.0781669616699219 cls_loss: 1.75814688205719
Valid iter 40 : acc: 0.0625 aux_loss:1.4953408241271973 cls_loss: 5.368460178375244
Valid iter 60 : acc: 0.375 aux_loss:1.4679415225982666 cls_loss: 2.9575371742248535
Valid iter 80 : acc: 0.59375 aux_loss:0.8214657306671143 cls_loss: 1.3301255702972412
Valid iter 100 : acc: 0.28125 aux_loss:1.1683253049850464 cls_loss: 4.290220737457275
Valid: acc: 0.4600105881690979 aux_loss:1.1079814434051514 cls_loss: 2.4592885971069336
------- learning rate [0.01], learning rate counter [-] -----
Epoch 1 iter 20 : acc: 0.5625 aux_loss:0.9905952215194702 cls_loss: 1.5344476699829102 period:2.336097478866577
Epoch 1 iter 40 : acc: 0.75 aux_loss:0.942418098449707 cls_loss: 1.1069670915603638 period:2.3488337993621826
Epoch 1 iter 60 : acc: 0.625 aux_loss:1.1455044746398926 cls_loss: 1.4981002807617188 period:2.9351730346679688

bn31dyow

bn31dyow7#

我看是不是先切换到单卡环境测试,多卡的环境打印输出太混乱了

8gsdolmq

8gsdolmq8#

这里reader的开销不一样,可能就不一样了,因为reader时间变短的话数据供应速度可能就会好一点

可以看看实验结果,从数据载入的时间序列里,能看出这两种情况的差别

这个xmap_readers是个很简单的装饰器,没有什么高深的逻辑
Paddle/python/paddle/reader/decorator.py

Line 364 in 0a29fc8

| | defxmap_readers(mapper, reader, process_num, buffer_size, order=False): |

这里面有俩队列,如果输入队列空了,就要等待数据载入才能继续训练,你想reader读取数据时间不同的话,队列里剩余的可用数据应该是不一样的

xvw2m8pv

xvw2m8pv9#

我自己改了跑下试试,但是有一点解释不通的就是把数据源换成jpg后就没有异常增大的现象了,如果按照您的解释,不管用什么数据源都会存在这个现象

xxb16uws

xxb16uws10#

你们内部人员还会缺算力?活久见

不能算内部,Paddle和AIStudio是俩部门

zvokhttg

zvokhttg11#

就增加了这两处,看看跑完之后的日志序列是怎样的

ghhaqwfi

ghhaqwfi12#

你们内部人员还会缺算力?活久见

fhg3lkii

fhg3lkii13#

不fork的话没有编辑权限,只能看您的代码

zynd9foi

zynd9foi14#

我是fork之后改的

ohtdti5x

ohtdti5x15#

改完了发现我没有算力...o(╥﹏╥)o

相关问题