我有一个spark任务,它读入一些tb的数据并执行两个窗口函数。这个作业在较小的块(4tb上的50k随机分区)中运行得很好,但是当我将数据输入增加到150k-200k时,15tb节点的随机分区开始失败。
发生这种情况有两个原因:
遗嘱执行人:
洗牌时超时
遗嘱执行人
20/07/01 15:58:14 ERROR YarnClusterScheduler: Lost executor 92 on ip-10-102-125-133.ec2.internal: Container killed by YARN for exceeding memory limits. 22.0 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
我已经增加了驱动程序的大小,以适应大的随机播放:
spark.driver.memory = 16g spark.driver.maxResultSize = 8g
执行器是r5.xlarge,具有以下配置: spark.executor.cores = 4
spark.executor.memory = 18971M spark.yarn.executor.memoryOverheadFactor = 0.1875
这远低于aws规定的最大值:https://docs.aws.amazon.com/emr/latest/releaseguide/emr-hadoop-task-config.html#emr-hadoop-task-config-r5 yarn.nodemanager.resource.memory-mb = 24576
我知道我需要调整 spark.yarn.executor.memoryOverheadFactor
这里是为了给这么多分区带来的巨大开销留出空间。希望这将是那里需要的最后改变。
洗牌超时
20/07/01 15:59:39 ERROR TransportChannelHandler: Connection to ip-10-102-116-184.ec2.internal/10.102.116.184:7337 has been quiet for 600000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
20/07/01 15:59:39 ERROR TransportResponseHandler: Still have 8 requests outstanding when connection from ip-10-102-116-184.ec2.internal/10.102.116.184:7337 is closed
20/07/01 15:59:39 ERROR OneForOneBlockFetcher: Failed while starting block fetches
我已将此超时调整如下: spark.network.timeout = 600
我可以进一步提高 spark.network.timeout
在conf中,让它安静下来,等待更长时间。我宁愿降低价格 Shuffle Read Blocked Time
,从1分钟到30分钟不等。有没有办法提高节点之间的通信速率?
我已尝试调整以下设置,但似乎无法提高此速度: spark.reducer.maxSizeInFlight = 512m
spark.shuffle.io.numConnectionsPerPeer = 5 spark.shuffle.io.backLog = 128
我需要调整什么来降低 Shuffle Read Blocked Time
在aws emr上?
1条答案
按热度按时间vi4fp9gy1#
对于遗嘱执行人,请这样做。它为我们解决了问题。发件人:https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
理由是:“容器因超过内存限制而被Yarn杀死。在具有75gb内存的emr群集上使用10.4 gb的10.4 gb物理内存
要解决洗牌超时问题,请尝试增加存储(ebs卷)。