Paddle meet problem,when update to newest version !

nbnkbykc  于 2021-11-30  发布在  Java
关注(0)|答案(4)|浏览(222)

when i update the paddle version to newest develop cpu version. i meet this problem.
`
Traceback (most recent call last):
File "main.py", line 443, in
main(config)
File "main.py", line 373, in main
fleet.init_server()
File "", line 2, in init_server
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, inimpl
return wrapped_func(args,kwargs)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 60, in
impl
*
return func(args,kwargs)
File "", line 2, in init_server
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in
impl
*
return wrapped_func(args,kwargs)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 44, in
impl
*
return func(*args,**kwargs)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 530, in init_server
self._runtime_handle._init_server(*args,**kwargs)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/runtime/the_one_ps.py", line 784, in _init_server
server = self._get_fleet_proto(is_server=True, is_sync=is_sync)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/runtime/the_one_ps.py", line 762, in _get_fleet_proto
tables = _get_tables()
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/runtime/the_one_ps.py", line 722, in _get_tables
self.compiled_strategy)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/runtime/the_one_ps.py", line 149, in parse_by_optimizer
optimizer_ops = _get_optimize_ops(main_program)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/incubate/fleet/parameter_server/ir/public.py", line 1142, in _get_optimize_ops
if _is_opt_role_op(op):
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/incubate/fleet/parameter_server/ir/public.py", line 1133, in _is_opt_role_op
int(op.all_attrs()[op_maker.kOpRoleAttrName()]) == int(optimize_role):
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2776, in all_attrs
attr_map[n] = self._block_attr(n)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2731, in _block_attr
assert (id >= 0 and id < len(self.block.program.blocks))
AssertionError

C++ Traceback (most recent call last):

0 paddle::framework::SignalHandle(char const*, int)
1 paddle::platform::GetCurrentTraceBackString

Error Message Summary:

FatalError: Termination signal is detected by the operating system.
[TimeInfo:Aborted at 1622627345 (unix time) try "date -d @1622627345" if you are using GNU date]
[SignalInfo:SIGTERM (@0x1351) received by PID 4960 (TID 0x7f441c37e700) from PID 4945]
`

it tells me, init server failed.

`
def main(args):

set_seed(args.seed)

server_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
cpu_num = int(os.getenv("CPU_NUM", 12))

log.info(args)

place = paddle.CPUPlace()
exe = paddle.static.Executor(place)

role = role_maker.PaddleCloudRoleMaker()
fleet.init(role)

log.info("Load Dataset...")
data = build_datasets(args)
train_ds = ShardedDataset(data.train_index, 'train', args.repeat)
valid_ds = ShardedDataset(data.valid_index, 'valid')

train_collate_fn = BatchRandWalk(data.graph, args, 'train')
valid_collate_fn = BatchRandWalk(data.graph, args, 'valid')

train_loader = pgl.utils.data.Dataloader(train_ds, 
                                         batch_size=args.batch_size // cpu_num,
                                         shuffle=True, 
                                         num_workers=args.sample_workers, 
                                         collate_fn=train_collate_fn)

valid_loader = pgl.utils.data.Dataloader(valid_ds, 
                                         batch_size=args.batch_size,
                                         shuffle=False, 
                                         num_workers=args.sample_workers, 
                                         collate_fn=valid_collate_fn)

log.info("Load Model...")
model = StaticGatneModel(config, data.graph)
decay_steps = math.ceil(data.graph.num_nodes * args.decay_epochs / args.batch_size /
                        cpu_num / server_num)    

test_program = paddle.static.default_main_program().clone(for_test=True)

log.info("Init Optimization...")
optimization(model.loss, decay_steps, args)

log.info("Init and Run Server or Worker...")
if fleet.is_server():
    fleet.init_server()
    fleet.run_server()

if fleet.is_worker():        
    exe.run(paddle.static.default_startup_program())

    fleet.init_worker()

    main_program = paddle.static.default_main_program()

    compiled_train_prog = build_complied_prog(main_program, model.loss, cpu_num)
    compiled_valid_prog = build_complied_prog(test_program, model.loss, 1)
    # valid before train
    top_f1 = 0
    log.info("Valid Before Train...")
    valid_prog(valid_loader, exe, test_program, model, args, data)
    for epoch in range(args.epochs):
        train_loss = train_prog(train_loader, exe, compiled_train_prog, model, args)
        log.info("epoch %s total train loss %s " % (epoch, train_loss))

        valid_result = valid_prog(valid_loader, exe, compiled_valid_prog, model, args, data)
        if valid_result['F1'] > top_f1:
            top_f1 = valid_result['F1']
            paddle.static.save(compiled_valid_prog, args.save_path)
            log.info("save checkpoints finished!!! %s 🚄🚄🚄🚄🚄" % args.save_path)

fleet.stop_worker()

`
the second problem is stuck when a_sync = False

为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】

建立issue时,为快速解决问题,请您根据使用情况给出如下信息:

  • 标题:请包含关键词“安装错误”/“编译错误”,例如“Mac编译错误”
  • 版本、环境信息:

   1)PaddlePaddle版本:请提供您的PaddlePaddle版本号(如1.1)或CommitID
   2)CPU:请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库的使用情况
   3)GPU:请提供GPU型号,CUDA和CUDNN版本号
   4)系统环境:请说明系统类型、版本(如Mac OS 10.14)、Python版本
注:您可以通过执行summary_env.py获取以上信息。

  • 安装方式信息:

1)pip安装/docker安装
2)本地编译:请提供cmake命令,编译命令
3)docker编译:请提供docker镜像,编译命令           
 特殊环境请注明:如离线安装等

  • 复现信息:如为报错,请给出复现环境、复现步骤
  • 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段

Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in Github in case that there was a similar issue submitted or resolved before.
If there is no solution,please make sure that this is an installation issue including the following details:

System information

-PaddlePaddle version (eg.1.1)or CommitID
-CPU: including CPUMKL/OpenBlas/MKLDNN version
-GPU: including CUDA/CUDNN version
-OS Platform (eg. Mac OS 10.14)
-Python version

  • Install method: pip install/install with docker/build from source(without docker)/build within docker
  • Other special cases that you think may be related to this problem, eg. offline install, special internet condition

Note: You can get most of the information by running summary_env.py.  

To Reproduce

Steps to reproduce the behavior

Describe your current behavior
Code to reproduce the issue
Other info / logs

lnlaulya

lnlaulya1#

你好, 使用的PGL的哪个模型? PGL版本是多少, Paddle版本是多少? 使用PGL官方示例能否复现问题?

plupiseo

plupiseo2#

您好,都是最新的版本(paddle develop昨天最新装,PGL==2.1.4) 我在用colletive的时候,是没有问题的, 但是在ps模式的时候,a_sync = False的时候会莫名的卡住! 前面那个问题是学习率导致的。去掉这一行可以解决     # decayed_lr = paddle.fluid.layers.learning_rate_scheduler.polynomial_decay(     #     learning_rate=args.lr,     #     decay_steps=decay_steps,     #     end_learning_rate=args.lr,     #     power=1.0,     #     cycle=True) 最后就是现在在ps模式(a_sync = True)的收敛速度明显慢于colletive。 ------------------ 原始邮件 ------------------ 发件人: "PaddlePaddle/Paddle"***@***.***>; 发送时间: 2021年6月3日(星期四) 中午11:24***@***.***>;***@***.******@***.***>; 主题: Re: [PaddlePaddle/Paddle] meet problem,when update to newest version ! (#33289) 你好, 使用的PGL的哪个模型? PGL版本是多少, Paddle版本是多少? 使用PGL官方示例能否复现问题? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

63lcw9qa

63lcw9qa3#

  1. a_sync=False会卡住是指 训练还没开始就卡住了, 还是中途会莫名卡住?
  2. 效果差于collective可能需要调一下超参, a_sync=True的模式是异步模式,前期收敛会慢于collective同步训练。
0yg35tkg

0yg35tkg4#

你好,valid是没有问题的,卡住是训练还没开始就卡住,但是valid before train可以运行。 效果比collective差的非常的多!lr和batch都调节过,效果不明显…

------------------ 原始邮件 ------------------ 发件人: "PaddlePaddle/Paddle"***@***.***>; 发送时间: 2021年6月3日(星期四) 中午1:53***@***.***>;***@***.******@***.***>; 主题: Re: [PaddlePaddle/Paddle] meet problem,when update to newest version ! (#33289) a_sync=False会卡住是指 训练还没开始就卡住了, 还是中途会莫名卡住? 效果差于collective可能需要调一下超参, a_sync=True的模式是异步模式,前期收敛会慢于collective同步训练。 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

相关问题