pyspark 使用Iceberg使用Spark阅读S3数据时出错：“您尝试访问的存储桶必须使用指定的端点进行寻址”

ruarlubt 于 12个月前发布在 Spark

关注(0)|答案(2)|浏览(173)

我在尝试使用Spark（PySpark）和Iceberg从S3 bucket读取数据时遇到了一个问题。我已经用以下命令配置了我的Spark应用程序：

...
conf.set("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1,org.apache.iceberg:iceberg-spark-extensions-3.4_2.12:1.3.1,software.amazon.awssdk:bundle:2.20.145,software.amazon.awssdk:url-connection-client:2.20.145")
conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
conf.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
conf.set("spark.sql.catalog.spark_catalog.type", "hive")
conf.set("spark.sql.catalog.spark_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
...

然而，我总是遇到以下错误：

py4j.protocol.Py4JJavaError: An error occurred while calling o82.sql.
: software.amazon.awssdk.services.s3.model.S3Exception: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. (Service: S3, Status Code: 301, Request ID: ..., Extended Request ID: ...)

我怀疑这个错误与错误配置的区域有关，我试图通过设置Spark配置属性来解决它：

conf.set("spark.hadoop.fs.s3.region", "eu-west-1")
conf.set("spark.hadoop.fs.s3a.region", "eu-west-1")
conf.set("spark.hadoop.fs.s3n.region", "eu-west-1")
conf.set("spark.sql.catalog.spark_catalog.hadoop.fs.s3.region", "eu-west-1")
conf.set("spark.sql.catalog.spark_catalog.hadoop.fs.s3a.region", "eu-west-1")
conf.set("spark.sql.catalog.spark_catalog.hadoop.fs.s3n.region", "eu-west-1")

然而，这似乎并没有解决问题。有人能帮助我了解导致此错误的原因吗？以及如何正确配置Spark应用程序以使用正确的端点访问S3存储桶？
谢谢你的帮助。

pyspark

来源：https://stackoverflow.com/questions/77099097/error-when-reading-s3-data-with-spark-using-iceberg-the-bucket-you-are-attempt