HadoopS3驱动程序403在几次成功请求后出错

bq9c1y66  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(357)

我使用awss3驱动程序和apachenutch将文件从ec2示例上传到s3 bucket。ec2附加了iam策略以允许访问s3存储桶:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::storage"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:GetObjectAcl"
      ],
      "Resource": [
        "arn:aws:s3:::storage/*"
      ]
    }
  ]
}

它在一开始工作得很好:nutch解析段并将其写入s3 bucket,但在几个段之后,它失败并出现错误:
状态代码:403,aws服务:amazon s3,aws请求id:…,aws错误代码:SignatureDesNotMatch,aws错误消息:我们计算的请求签名与您提供的签名不匹配。

com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: ...
        at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
        at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1507)
        at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:143)
        at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:131)
        at com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:189)
        at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:134)
        at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:46)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[ERROR] org.apache.nutch.crawl.CrawlDb: CrawlDb update job did not succeed, job status:FAILED, reason: NA
Exception in thread "main" java.lang.RuntimeException: CrawlDb update job did not succeed, job status:FAILED, reason: NA
        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:142)
        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:83)

我想iam策略是可以的,因为nutch可以在失败之前上传一些片段。
我的aws hadoop相关配置是:

com.amazonaws.services.s3.enableV4=true
fs.s3a.endpoint=s3.us-east-2.amazonaws.com

为什么我会遇到这个错误以及如何修复它?
更新:我在一台ec2机器(不是hadoop集群)上以编程方式(不是从cli)运行nutch,以访问我使用的s3 s3a 文件系统(输出路径为 s3a://mybucket/data ). hadoop版本是 2.7.3 ,nutch版本是 1.15 .

pcrecxhr

pcrecxhr1#

由于s3不一致的副作用,在本地模式下运行时会出现上述错误。
由于s3只提供了在写后读的最终一致性,因此不能保证在列出文件或尝试重命名文件时,它会存在于s3存储桶中,即使它之前刚刚被写入。
hadoop团队还提供了以下故障排除指南:https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md
如果您的用例需要在本地模式下运行,我建议您采取以下措施:
将文件写入 local-folder 使用 aws s3 sync local-folder s3://bucket-name --region region-name --delete

相关问题