如果首先尝试减少fails(网络连接问题),那么随后的reduce尝试(重试)将失败,因为输出文件已经存在

xkrw2x1b  于 2021-05-30  发布在  Hadoop
关注(0)|答案(1)|浏览(377)

我在amazonemr上的mapreduce作业失败了,因为如果第一次尝试未能将结果复制到s3,那么将创建文件(可能是部分文件),随后的reduce尝试将拒绝对已经存在的文件进行写入。
第一次尝试日志:

014-11-30 06:56:19,774 INFO [main] com.amazonaws.latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: null; Request ID: removed), S3 Extended Request ID: removed=], ServiceName=[Amazon S3], AWSErrorCode=[null], AWSRequestID=[removed], ServiceEndpoint=[https://devel.rui.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=0, ClientExecuteTime=[130.087], HttpRequestTime=[118.72], HttpClientReceiveResponseTime=[32.585], RequestSigningTime=[0.646], HttpClientSendRequestTime=[0.835], 
2014-11-30 06:56:19,803 INFO [main] com.amazonaws.latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: null; Request ID: removed), S3 Extended Request ID: 1removed=], ServiceName=[Amazon S3], AWSErrorCode=[null], AWSRequestID=[removed], ServiceEndpoint=[https://removed.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[27.899], HttpRequestTime=[26.898], HttpClientReceiveResponseTime=[9.405], RequestSigningTime=[0.559], HttpClientSendRequestTime=[1.016], 
2014-11-30 06:56:19,939 INFO [main] com.amazonaws.latency: StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[removed], ServiceEndpoint=[https://removedi.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[127.219], HttpRequestTime=[20.791], HttpClientReceiveResponseTime=[15.467], RequestSigningTime=[0.391], ResponseProcessingTime=[82.617], HttpClientSendRequestTime=[0.955], 
2014-11-30 06:56:19,999 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords

重试日志(所有日志看起来相同):

RequestSigningTime=[0.663], ResponseProcessingTime=[12.466], HttpClientSendRequestTime=[0.832], 
2014-11-30 07:23:56,526 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child :

java.io.ioexception:文件已存在exists:s3n://删除/删除/part-r-00005.gz

at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:615)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:891)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:788)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:169)
    at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
    at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:548)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:622)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

有趣的是,如果我打开partfiile0005.gz,它里面有东西,是应该的格式。
任何想法,如何解决这个问题(以及如何做到):a)增加处理延迟(例如,增加超时时间)b)重试删除现有文件(如果已经存在)。

lzfw57am

lzfw57am1#

您可以修改作业,将输出写入临时目录,该目录以jobid或时间戳命名,以确保唯一性,然后在处理完成后,将内容移动到所需的输出位置。这样,如果在写入部分输出后处理时出错,则所需的输出目录不会受到影响。这也意味着您不会意外地读取失败作业的部分输出。

相关问题