尝试在HDP3.1集群上运行h2o并遇到错误,这似乎与Yarn资源容量有关。。。
[ml1user@HW04 h2o-3.26.0.1-hdp3.1]$ hadoop jar h2odriver.jar -nodes 3 -mapperXmx 10g
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 192.168.122.1]
[Possible callback IP address: 172.18.4.49]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46015
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
mapreduce.map.java.opts: -Xms10g -Xmx10g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 11264
Hive driver not present, not generating token.
19/07/25 14:48:05 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:48:06 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/07/25 14:48:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1user/.staging/job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: number of splits:3
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/07/25 14:48:08 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/07/25 14:48:08 INFO impl.YarnClientImpl: Submitted application application_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1564020515809_0006/
Job name 'H2O_47159' submitted
JobTracker job ID is 'job_1564020515809_0006'
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Waiting for H2O cluster to come up...
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
19/07/25 14:50:19 INFO impl.YarnClientImpl: Killed application application_1564020515809_0006
Killed.
19/07/25 14:50:23 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:50:23 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
----- YARN cluster metrics -----
Number of YARN worker nodes: 3
----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
----- Queues -----
Queue name: default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 1.00
Maximum capacity: 1.00
Application count: 0
Queue 'default' approximate utilization: 0.0 / 45.0 GB used, 0 / 9 vcores used
----------------------------------------------------------------------
ERROR: Unable to start any H2O nodes; please contact your YARN administrator.
A common cause for this is the requested container size (11.0 GB)
exceeds the following YARN settings:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-mb
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
查看ambari ui中的Yarn配置,这些属性不见踪影。但是,在yarn资源管理器ui中检查yarn日志,并检查被杀死的应用程序的一些日志,我看到了似乎无法访问的主机错误。。。
Container: container_e05_1564020515809_0006_02_000002 on HW03.ucera.local_45454_1564102219781
LogAggregationType: AGGREGATED
=============================================================================================
LogType:stderr
LogLastModifiedTime:Thu Jul 25 14:50:19 -1000 2019
LogLength:2203
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/11/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/ml1user/appcache/application_1564020515809_0006/filecache/10/job.jar/job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.YarnChild).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.net.NoRouteToHostException: No route to host (Host unreachable)
at java.net.PlainSocketImpl.socketConnect(Native Method)
....
at java.net.Socket.<init>(Socket.java:211)
at water.hadoop.EmbeddedH2OConfig$BackgroundWriterThread.run(EmbeddedH2OConfig.java:38)
End of LogType:stderr
***********************************************************************
注意“java.net.noroutetohestexception:no route to host(host unreachable)”。但是,我可以从彼此访问所有其他节点,它们都可以相互ping,所以不确定这里发生了什么。对调试或修复有什么建议吗?
2条答案
按热度按时间ttvkxqim1#
我想我发现了这个问题,tldr:firewalld(在centos7上运行的节点)仍然在运行,在hdp集群上应该禁用它。
从另一个社区帖子:
为了让ambari在安装过程中与它部署和管理的主机通信,某些端口必须是开放的和可用的。最简单的方法是临时禁用iptables,如下所示:
systemctl disable firewalld
service firewalld stop
显然如此iptables
以及firewalld
需要在集群中禁用(这里可以找到支持文档,我只在ambari安装节点上禁用了它们)。在集群中停止这些服务(我建议使用clush)之后,您就能够毫无意外地运行yarn作业了。pkln4tw62#
通常,此问题可能是由于dns配置错误、防火墙或网络不可访问造成的。引用这份官方文件:
配置文件中远程计算机的主机名错误
客户端的主机表/etc/hosts的目标主机IP地址无效。
dns服务器的主机表的目标主机IP地址无效。
客户端的路由表(在linux中,iptables)是错误的。
dhcp服务器正在发布错误的路由信息。
客户机和服务器位于不同的子网上,不设置为相互通信。这可能是意外,也可能是故意锁定hadoop集群。
这些机器正在尝试使用ipv6进行通信。hadoop目前不支持ipv6
主机的ip地址已更改,但长寿命的jvm正在缓存旧值。这是JVM的一个已知问题(有关详细信息和解决方案,请搜索“java负dns缓存”)。快速解决方案:重启JVM
对我来说,问题是驱动程序在docker容器中,这使得工人无法将数据发送回容器。换句话说,工人和驱动程序不在同一子网中。此答案中给出的解决方案是设置以下配置: