我的mr作业以map 100%reduce 35%结束,有很多类似于的错误消息 running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container.
我的意见 *.bz2
文件大约是4gb,如果我解压缩它,它的大小将是大约38gb,它花了大约一个小时来运行这个作业 one Master
以及 two slavers
在亚马逊电子病历上。
我的问题是
-为什么这份工作占用这么多内存?
-为什么这项工作要花一个小时?通常,在一个小型的4节点集群上运行一个40gb的wordcount作业需要大约10分钟。
-如何调整mr参数来解决这个问题?
-哪些amazonec2示例类型最适合解决这个问题?
请参阅以下日志:
-物理内存(字节)快照=43327889408=> 43.3GB
-虚拟内存(字节)快照=108950675456=> 108.95GB
-提交的堆使用总量(字节)=34940649472=> 34.94GB
我建议的解决方案如下,但我不确定它们是否正确
-使用更大的amazonec2示例,其内存至少为8gb
-使用以下代码调整mr参数
版本1:
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest1");
//don't kill the container, if the physical memory exceeds "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb"
conf.setBoolean("yarn.nodemanager.pmem-check-enabled", false);
conf.setBoolean("yarn.nodemanager.vmem-check-enabled", false);
版本2:
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest2");
//conf.set("mapreduce.input.fileinputformat.split.minsize","3073741824");
conf.set("mapreduce.map.memory.mb", "8192");
conf.set("mapreduce.map.java.opts", "-Xmx6144m");
conf.set("mapreduce.reduce.memory.mb", "8192");
conf.set("mapreduce.reduce.java.opts", "-Xmx6144m");
日志:
15/11/08 11:37:27 INFO mapreduce.Job: map 100% reduce 35%
15/11/08 11:37:27 INFO mapreduce.Job: Task Id : attempt_1446749367313_0006_r_000006_2, Status : FAILED
Container [pid=24745,containerID=container_1446749367313_0006_01_003145] is running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container.
Dump of the process-tree for container_1446749367313_0006_01_003145 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 24745 24743 24745 24745 (bash) 0 0 9658368 291 /bin/bash -c /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2304m -Djava.io.tmpdir=/mnt1/yarn/usercache/ec2-user/appcache/application_1446749367313_0006/container_1446749367313_0006_01_003145/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild**.***.***.***32846 attempt_1446749367313_0006_r_000006_2 3145 1>/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145/stdout 2>/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145/stderr
|- 24749 24745 24745 24745 (java) 14124 1281 3910426624 789477 /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2304m -Djava.io.tmpdir=/mnt1/yarn/usercache/ec2-user/appcache/application_1446749367313_0006/container_1446749367313_0006_01_003145/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild**.***.***.***32846 attempt_1446749367313_0006_r_000006_2 3145
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
15/11/08 11:37:28 INFO mapreduce.Job: map 100% reduce 25%
15/11/08 11:37:30 INFO mapreduce.Job: map 100% reduce 26%
15/11/08 11:37:37 INFO mapreduce.Job: map 100% reduce 27%
15/11/08 11:37:42 INFO mapreduce.Job: map 100% reduce 28%
15/11/08 11:37:53 INFO mapreduce.Job: map 100% reduce 29%
15/11/08 11:37:57 INFO mapreduce.Job: map 100% reduce 34%
15/11/08 11:38:02 INFO mapreduce.Job: map 100% reduce 35%
15/11/08 11:38:13 INFO mapreduce.Job: map 100% reduce 36%
15/11/08 11:38:22 INFO mapreduce.Job: map 100% reduce 37%
15/11/08 11:38:35 INFO mapreduce.Job: map 100% reduce 42%
15/11/08 11:38:36 INFO mapreduce.Job: map 100% reduce 100%
15/11/08 11:38:36 INFO mapreduce.Job: Job job_1446749367313_0006 failed with state FAILED due to: Task failed task_1446749367313_0006_r_000001
Job failed as tasks failed. failedMaps:0 failedReduces:1
15/11/08 11:38:36 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=11806418671
FILE: Number of bytes written=22240791936
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=16874
HDFS: Number of bytes written=0
HDFS: Number of read operations=59
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
S3: Number of bytes read=3942336319
S3: Number of bytes written=0
S3: Number of read operations=0
S3: Number of large read operations=0
S3: Number of write operations=0
Job Counters
Failed reduce tasks=22
Killed reduce tasks=5
Launched map tasks=59
Launched reduce tasks=27
Data-local map tasks=59
Total time spent by all maps in occupied slots (ms)=114327828
Total time spent by all reduces in occupied slots (ms)=131855700
Total time spent by all map tasks (ms)=19054638
Total time spent by all reduce tasks (ms)=10987975
Total vcore-seconds taken by all map tasks=19054638
Total vcore-seconds taken by all reduce tasks=10987975
Total megabyte-seconds taken by all map tasks=27438678720
Total megabyte-seconds taken by all reduce tasks=31645368000
Map-Reduce Framework
Map input records=728795619
Map output records=728795618
Map output bytes=50859151614
Map output materialized bytes=10506705085
Input split bytes=16874
Combine input records=0
Spilled Records=1457591236
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=150143
CPU time spent (ms)=14360870
Physical memory (bytes) snapshot=43327889408
Virtual memory (bytes) snapshot=108950675456
Total committed heap usage (bytes)=34940649472
File Input Format Counters
Bytes Read=0
2条答案
按热度按时间9gm1akwq1#
您的输入文件大小应该包含减速机的数量。标准值是每1GB减少1个大小写,除非压缩Map器输出数据。所以在这种情况下,理想值应该至少是38。尝试将命令行选项传递为-d mapred.reduce.tasks=40,并查看是否有任何更改。
jvidinwx2#
我不确定亚马逊电子病历。关于map reduce,需要考虑的要点有:
bzip2速度较慢,尽管它比gzip压缩得更好。bzip2的解压速度比压缩速度快,但仍然比其他格式慢。因此,在较高的层次上,您已经有了这一点,而40gb的字数计算程序只运行了10分钟(假设40gb程序没有压缩)。下一个问题是,但要慢多少
然而,一小时后你的工作仍然失败。请确认一下。因此,只有当作业成功运行时,才能提高性能。出于这个原因,让我们想想为什么它会失败。你的记忆有问题。同样基于错误,容器在reducer阶段失败(mapper阶段100%完成)。大多数情况下,甚至没有一个减速机可能成功。即使32%可能会欺骗你认为一些减速器运行,这%可能是由于准备清理工作前,第一减速器运行。确认的一种方法是,查看是否生成了任何reducer输出文件。
一旦确认没有运行任何reducer,就可以根据版本2增加容器的内存。
您的版本1将帮助您查看是否只有特定的容器导致问题并允许作业完成。