我使用awss3驱动程序和apachenutch将文件从ec2示例上传到s3 bucket。ec2附加了iam策略以允许访问s3存储桶:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::storage"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:GetObjectAcl"
],
"Resource": [
"arn:aws:s3:::storage/*"
]
}
]
}
它在一开始工作得很好:nutch解析段并将其写入s3 bucket,但在几个段之后,它失败并出现错误:
状态代码:403,aws服务:amazon s3,aws请求id:…,aws错误代码:SignatureDesNotMatch,aws错误消息:我们计算的请求签名与您提供的签名不匹配。
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: ...
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1507)
at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:143)
at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:131)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:189)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:134)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[ERROR] org.apache.nutch.crawl.CrawlDb: CrawlDb update job did not succeed, job status:FAILED, reason: NA
Exception in thread "main" java.lang.RuntimeException: CrawlDb update job did not succeed, job status:FAILED, reason: NA
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:142)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:83)
我想iam策略是可以的,因为nutch可以在失败之前上传一些片段。
我的aws hadoop相关配置是:
com.amazonaws.services.s3.enableV4=true
fs.s3a.endpoint=s3.us-east-2.amazonaws.com
为什么我会遇到这个错误以及如何修复它?
更新:我在一台ec2机器(不是hadoop集群)上以编程方式(不是从cli)运行nutch,以访问我使用的s3 s3a
文件系统(输出路径为 s3a://mybucket/data
). hadoop版本是 2.7.3
,nutch版本是 1.15
.
1条答案
按热度按时间pcrecxhr1#
由于s3不一致的副作用,在本地模式下运行时会出现上述错误。
由于s3只提供了在写后读的最终一致性,因此不能保证在列出文件或尝试重命名文件时,它会存在于s3存储桶中,即使它之前刚刚被写入。
hadoop团队还提供了以下故障排除指南:https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md
如果您的用例需要在本地模式下运行,我建议您采取以下措施:
将文件写入
local-folder
使用aws s3 sync local-folder s3://bucket-name --region region-name --delete