when i update the paddle version to newest develop cpu version. i meet this problem.
`
Traceback (most recent call last):
File "main.py", line 443, in
main(config)
File "main.py", line 373, in main
fleet.init_server()
File "", line 2, in init_server
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, inimpl
return wrapped_func(args,kwargs)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 60, inimpl*
return func(args,kwargs)
File "", line 2, in init_server
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, inimpl*
return wrapped_func(args,kwargs)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 44, inimpl*
return func(*args,**kwargs)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py", line 530, in init_server
self._runtime_handle._init_server(*args,**kwargs)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/runtime/the_one_ps.py", line 784, in _init_server
server = self._get_fleet_proto(is_server=True, is_sync=is_sync)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/runtime/the_one_ps.py", line 762, in _get_fleet_proto
tables = _get_tables()
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/runtime/the_one_ps.py", line 722, in _get_tables
self.compiled_strategy)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/distributed/fleet/runtime/the_one_ps.py", line 149, in parse_by_optimizer
optimizer_ops = _get_optimize_ops(main_program)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/incubate/fleet/parameter_server/ir/public.py", line 1142, in _get_optimize_ops
if _is_opt_role_op(op):
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/incubate/fleet/parameter_server/ir/public.py", line 1133, in _is_opt_role_op
int(op.all_attrs()[op_maker.kOpRoleAttrName()]) == int(optimize_role):
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2776, in all_attrs
attr_map[n] = self._block_attr(n)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2731, in _block_attr
assert (id >= 0 and id < len(self.block.program.blocks))
AssertionError
C++ Traceback (most recent call last):
0 paddle::framework::SignalHandle(char const*, int)
1 paddle::platform::GetCurrentTraceBackString
Error Message Summary:
FatalError: Termination signal
is detected by the operating system.
[TimeInfo:Aborted at 1622627345 (unix time) try "date -d @1622627345" if you are using GNU date]
[SignalInfo:SIGTERM (@0x1351) received by PID 4960 (TID 0x7f441c37e700) from PID 4945]
`
it tells me, init server failed.
`
def main(args):
set_seed(args.seed)
server_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
cpu_num = int(os.getenv("CPU_NUM", 12))
log.info(args)
place = paddle.CPUPlace()
exe = paddle.static.Executor(place)
role = role_maker.PaddleCloudRoleMaker()
fleet.init(role)
log.info("Load Dataset...")
data = build_datasets(args)
train_ds = ShardedDataset(data.train_index, 'train', args.repeat)
valid_ds = ShardedDataset(data.valid_index, 'valid')
train_collate_fn = BatchRandWalk(data.graph, args, 'train')
valid_collate_fn = BatchRandWalk(data.graph, args, 'valid')
train_loader = pgl.utils.data.Dataloader(train_ds,
batch_size=args.batch_size // cpu_num,
shuffle=True,
num_workers=args.sample_workers,
collate_fn=train_collate_fn)
valid_loader = pgl.utils.data.Dataloader(valid_ds,
batch_size=args.batch_size,
shuffle=False,
num_workers=args.sample_workers,
collate_fn=valid_collate_fn)
log.info("Load Model...")
model = StaticGatneModel(config, data.graph)
decay_steps = math.ceil(data.graph.num_nodes * args.decay_epochs / args.batch_size /
cpu_num / server_num)
test_program = paddle.static.default_main_program().clone(for_test=True)
log.info("Init Optimization...")
optimization(model.loss, decay_steps, args)
log.info("Init and Run Server or Worker...")
if fleet.is_server():
fleet.init_server()
fleet.run_server()
if fleet.is_worker():
exe.run(paddle.static.default_startup_program())
fleet.init_worker()
main_program = paddle.static.default_main_program()
compiled_train_prog = build_complied_prog(main_program, model.loss, cpu_num)
compiled_valid_prog = build_complied_prog(test_program, model.loss, 1)
# valid before train
top_f1 = 0
log.info("Valid Before Train...")
valid_prog(valid_loader, exe, test_program, model, args, data)
for epoch in range(args.epochs):
train_loss = train_prog(train_loader, exe, compiled_train_prog, model, args)
log.info("epoch %s total train loss %s " % (epoch, train_loss))
valid_result = valid_prog(valid_loader, exe, compiled_valid_prog, model, args, data)
if valid_result['F1'] > top_f1:
top_f1 = valid_result['F1']
paddle.static.save(compiled_valid_prog, args.save_path)
log.info("save checkpoints finished!!! %s 🚄🚄🚄🚄🚄" % args.save_path)
fleet.stop_worker()
`
the second problem is stuck when a_sync = False
为使您的问题得到快速解决,在建立Issue前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
建立issue时,为快速解决问题,请您根据使用情况给出如下信息:
- 标题:请包含关键词“安装错误”/“编译错误”,例如“Mac编译错误”
- 版本、环境信息:
1)PaddlePaddle版本:请提供您的PaddlePaddle版本号(如1.1)或CommitID
2)CPU:请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库的使用情况
3)GPU:请提供GPU型号,CUDA和CUDNN版本号
4)系统环境:请说明系统类型、版本(如Mac OS 10.14)、Python版本
注:您可以通过执行summary_env.py获取以上信息。
- 安装方式信息:
1)pip安装/docker安装
2)本地编译:请提供cmake命令,编译命令
3)docker编译:请提供docker镜像,编译命令
特殊环境请注明:如离线安装等
- 复现信息:如为报错,请给出复现环境、复现步骤
- 问题描述:请详细描述您的问题,同步贴出报错信息、日志/代码关键片段
Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in Github in case that there was a similar issue submitted or resolved before.
If there is no solution,please make sure that this is an installation issue including the following details:
System information
-PaddlePaddle version (eg.1.1)or CommitID
-CPU: including CPUMKL/OpenBlas/MKLDNN version
-GPU: including CUDA/CUDNN version
-OS Platform (eg. Mac OS 10.14)
-Python version
- Install method: pip install/install with docker/build from source(without docker)/build within docker
- Other special cases that you think may be related to this problem, eg. offline install, special internet condition
Note: You can get most of the information by running summary_env.py.
To Reproduce
Steps to reproduce the behavior
Describe your current behavior
Code to reproduce the issue
Other info / logs
4条答案
按热度按时间lnlaulya1#
你好, 使用的PGL的哪个模型? PGL版本是多少, Paddle版本是多少? 使用PGL官方示例能否复现问题?
plupiseo2#
您好,都是最新的版本(paddle develop昨天最新装,PGL==2.1.4) 我在用colletive的时候,是没有问题的, 但是在ps模式的时候,a_sync = False的时候会莫名的卡住! 前面那个问题是学习率导致的。去掉这一行可以解决 # decayed_lr = paddle.fluid.layers.learning_rate_scheduler.polynomial_decay( # learning_rate=args.lr, # decay_steps=decay_steps, # end_learning_rate=args.lr, # power=1.0, # cycle=True) 最后就是现在在ps模式(a_sync = True)的收敛速度明显慢于colletive。 ------------------ 原始邮件 ------------------ 发件人: "PaddlePaddle/Paddle"***@***.***>; 发送时间: 2021年6月3日(星期四) 中午11:24***@***.***>;***@***.******@***.***>; 主题: Re: [PaddlePaddle/Paddle] meet problem,when update to newest version ! (#33289) 你好, 使用的PGL的哪个模型? PGL版本是多少, Paddle版本是多少? 使用PGL官方示例能否复现问题? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
63lcw9qa3#
0yg35tkg4#
你好,valid是没有问题的,卡住是训练还没开始就卡住,但是valid before train可以运行。 效果比collective差的非常的多!lr和batch都调节过,效果不明显…
------------------ 原始邮件 ------------------ 发件人: "PaddlePaddle/Paddle"***@***.***>; 发送时间: 2021年6月3日(星期四) 中午1:53***@***.***>;***@***.******@***.***>; 主题: Re: [PaddlePaddle/Paddle] meet problem,when update to newest version ! (#33289) a_sync=False会卡住是指 训练还没开始就卡住了, 还是中途会莫名卡住? 效果差于collective可能需要调一下超参, a_sync=True的模式是异步模式,前期收敛会慢于collective同步训练。 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.