我正在尝试在hadoop集群上运行基于giraph的应用程序。我使用的命令是 yarn jar solver-1.0-SNAPSHOT.jar edu.agh.iga.adi.giraph.IgaSolverTool
首先,我需要将该jar复制到发布时报告的一个目录中 yarn classpath
. 可以肯定的是,将文件权限更改为777。
我显然需要把jar运到工人那里,所以我做了: conf.setYarnLibJars(currentJar());
在代码里 currentJar()
是:
private static String currentJar() {
return new File(IgaGiraphJobFactory.class.getProtectionDomain()
.getCodeSource()
.getLocation()
.getPath()).getName();
}
这个用户使用jar名称,这个名称看起来很好,因为应用程序不再快速崩溃(如果使用其他名称的话)。相反,大约需要10分钟才能报告故障。日志中有错误:
LogType:gam-stderr.log
LogLastModifiedTime:Sat Sep 14 13:24:52 +0000 2019
LogLength:2122
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/kbhit/appcache/application_1568451681492_0016/filecache/11/solver-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "pool-6-thread-2" java.lang.IllegalStateException: Could not configure the containerlaunch context for GiraphYarnTasks.
at org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:391)
at org.apache.giraph.yarn.GiraphApplicationMaster.access$500(GiraphApplicationMaster.java:78)
at org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.buildContainerLaunchContext(GiraphApplicationMaster.java:522)
at org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.run(GiraphApplicationMaster.java:479)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://iga-adi-m/user/yarn/giraph_yarn_jar_cache/application_1568451681492_0016/solver-1.0-SNAPSHOT.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1533)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1526)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1541)
at org.apache.giraph.yarn.YarnUtils.addFileToResourceMap(YarnUtils.java:153)
at org.apache.giraph.yarn.YarnUtils.addFsResourcesToMap(YarnUtils.java:77)
at org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:387)
... 6 more
End of LogType:gam-stderr.log.This log file belongs to a running container (container_1568451681492_0016_01_000001) and so may not be complete.
这会导致工作容器中出现类未找到错误(giraphyarntask)。
似乎出于某种原因,jar没有和config一起被传输到hdfs中(这是不正确的)。原因可能是什么?
而且,jar好像要被送出去了
1492_0021/solver-1.0-SNAPSHOT.jar, packetSize=65016, chunksPerPacket=126, bytesCurBlock=73672704
2019-09-14 14:08:26,252 DEBUG [DFSOutputStream] - enqueue full packet seqno: 1142 offsetInBlock: 73672704 lastPacketInBlock: false lastByteOffsetInBlock: 73737216, src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, bytesCurBlock=73737216, blockSize=134217728, appendChunk=false, blk_1073741905_1081@[DatanodeInfoWithStorage[10.164.0.6:9866,DS-2d8f815f-1e64-4a7f-bbf6-0c91ebc613d7,DISK], DatanodeInfoWithStorage[10.164.0.7:9866,DS-6a606f45-ffb7-449f-ab8b-57d5950d5172,DISK]]
2019-09-14 14:08:26,252 DEBUG [DataStreamer] - Queued packet 1142
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1142 offsetInBlock: 73672704 lastPacketInBlock: false lastByteOffsetInBlock: 73737216
2019-09-14 14:08:26,253 DEBUG [DFSClient] - computePacketChunkSize: src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, chunkSize=516, chunksPerPacket=126, packetSize=65016
2019-09-14 14:08:26,253 DEBUG [DFSClient] - DFSClient writeChunk allocating new packet seqno=1143, src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, packetSize=65016, chunksPerPacket=126, bytesCurBlock=73737216
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DFSClient seqno: 1141 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 323347 flag: 0 flag: 0
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DFSClient seqno: 1142 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 326916 flag: 0 flag: 0
2019-09-14 14:08:26,254 DEBUG [DataStreamer] - Queued packet 1143
2019-09-14 14:08:26,256 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1143 offsetInBlock: 73737216 lastPacketInBlock: false lastByteOffsetInBlock: 73771432
2019-09-14 14:08:26,256 DEBUG [DataStreamer] - Queued packet 1144
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - Waiting for ack for: 1144
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - DFSClient seqno: 1143 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 497613 flag: 0 flag: 0
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1144 offsetInBlock: 73771432 lastPacketInBlock: true lastByteOffsetInBlock: 73771432
2019-09-14 14:08:26,263 DEBUG [DataStreamer] - DFSClient seqno: 1144 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 2406978 flag: 0 flag: 0
2019-09-14 14:08:26,263 DEBUG [DataStreamer] - Closing old block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081
2019-09-14 14:08:26,264 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit sending #12 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete
2019-09-14 14:08:26,266 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit got value #12
2019-09-14 14:08:26,267 DEBUG [ProtobufRpcEngine] - Call: complete took 4ms
2019-09-14 14:08:26,267 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit sending #13 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo
2019-09-14 14:08:26,268 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit got value #13
2019-09-14 14:08:26,268 DEBUG [ProtobufRpcEngine] - Call: getFileInfo took 1ms
2019-09-14 14:08:26,269 INFO [YarnUtils] - Registered file in LocalResources :: hdfs://iga-adi-m/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar
但一旦我检查了里面的东西就空了
2019-09-14 14:16:42,795 DEBUG [ProtobufRpcEngine] - Call: getListing took 6ms
Found 1 items
-rw-r--r-- 2 yarn hadoop 187800 2019-09-14 14:08 hdfs://iga-adi-m/user/yarn/giraph_yarn_jar_cache/application_1568451681492_0021/giraph-conf.xml
同时,如果我只是手动将jar复制到那个目录(预测它的名称),一切都会按预期工作。怎么了?
我想可能和这个giraph-859有关
2条答案
按热度按时间dluptydi1#
看来,即使giraph维护者声称它可以在Yarn模式下运行,这也不是真的。有很多bug,如果你不知道根本原因是什么,就很难解决,比如这个例子。
这里的原因是,当giraph将jar发送到hdfs时,它使用一个位置上传,另一个位置下载,因此工作人员找不到该文件。如果我们使用与yarn不同的用户来启动应用程序,就会发生这种情况——这可能是一个相当常见的假设。
有3种解决方法,两者都不理想(有些可能不适用):
只需使用yarn user运行应用程序
在每次计算之前手动上传jar(请注意,您必须确保您正在上传到新目录(只需增加作业编号)-还要记住,您必须首先创建该目录
应用此修补程序并根据此版本的giraph进行构建
三个都测试过了,一切正常。
ctehm74n2#
我也犯了类似的错误:
但在我的例子中,原因是我使用了aggregatorwriter,必须从上一次运行中删除writer的文件。还有一个
file already exist error
在另一个容器中,但一开始我发现了这个问题,也许这些信息对其他人有帮助。